5
BIOLOGY
Sequencing the Genome
Just as personal digital technologies have caused
economic, social and scientific revolutions unimagined
when we had our first few computers, we must expect and
prepare for similar changes as we move forward from our
first few genomes.
IN DECEMBER 2010 in Milwaukee, Wisconsin, Nicholas Volker, a five-year-old boy with a gastrointestinal condition that had not previously been seen, who had undergone over a hundred surgical operations and was almost constantly hospitalized and intermittently septic, was virtually on death’s door. But when his DNA sequence was determined, his doctors found the culprit mutation. That discovery led to the proper treatment, and now Nicholas is healthy and thriving. Even though this was only the first clearly documented case of the life-saving power of human genomics in medicine, few could now deny that the field was going to have a vital role in the future of medicine. Some would argue that the treatment led to an even bigger breakthrough: health insurance coverage of sequencing costs for select cases.
2
It took the better part of a decade from the completion of the first draft of the Human Genome Project for genomics to reach the clinic in such a dramatic way. To make treatment like Volker’s common will likely take more time still. Even if that’s the ultimate prize, the creative destruction of medicine still has various other, less comprehensive, genomic tools for us to use, based on investigations of things like single-nucleotide polymorphisms, the exome, and more. The material can be a bit heady, but it’s worth pushing through: these tools could effect not just dramatic corrections of faulty genes but a better, more scientific understanding of disease susceptibility and what drugs to take. Moreover, as they empower patients and democratize medicine, they make medical knowledge available to all and deep knowledge of ourselves available to each of us. Nevertheless, at this level, perhaps more than anywhere else in this ongoing medical revolution, the resistance from the priesthood of medicine is at its height. The fight might be tougher than the material, but in neither case can we afford to give up.
GENOMICS 101
More than ten years ago, on June 26, 2000, a major ceremony was held at the White House to announce the first draft of the human genome sequence. President Bill Clinton said, “Today we are learning the language in which God created life. . . . It will revolutionize the diagnosis, prevention, and treatment of most, if not all human diseases.”
3 He called it the “most important, most wondrous map ever produced by humankind.” The
New York Times headline was “Genetic Code of Human Life Is Cracked by Scientists,” and
Time featured on its cover Dr. Francis Collins, who led the NIH public-funded consortium, and Dr. Craig Venter, who led the private Celera Genomics effort, with the title “Cracking the Code!” The excitement was high, but even today—both among physicians and the public alike—we have yet to gain real knowledge about what was discovered, then and in the ensuing decade.
The human genome comprises twenty-three paired chromosomes with more than three billion bases each, arrayed in a double helix. Because we are “diploid,” with two copies of each chromosome, one paternally derived and the other maternally derived, we have about six billion bases in our DNA. So while there are lots and lots of them, the DNA sequence per se is actually fairly straightforward. Each base can be one of only four molecules—the four “life codes”—adenosine (A), guanine (G), cytosine (C), and thymidine (T). Despite the simplicity of the code, it is difficult to read it, and developing the first machines that could do it was expensive, complicated, and time consuming. The other fundamental problem facing genomics, besides the sheer volume of code, is understanding what it all actually means. It has been a “fact” of elementary biological education for some time that a DNA string of three bases codes for an amino acid and that a collection of amino acids (like histidine and tyrosine) is assembled to form a protein—or more simply that genes code for proteins. Many of us wish it were that simple!
It turns out that of the whole six million bases, only 1.5 percent actually represent coding elements, known as exons of genes, that are capable of producing proteins. Collectively this portion of DNA is known as the exome. Before the Human Genome Project, the genomics community had expected there would be at least 100,000 genes in our DNA, and perhaps many more than that. Nevertheless, the final gene count is fewer than 23,000 genes in the human genome. That means that approximately 98.5 percent of the genome does not code for proteins and does not fulfill the classic definition of a “gene.” If not coding for proteins, what is the vast majority of the genome doing?
The answer is regulating. Within genes there are promoters and introns, which also do not code for proteins. Promoters turn on or off the process of transcription, by which the DNA of the exome is copied into messenger RNA (mRNA), which itself gets read and translated into amino acids and proteins. Introns don’t code for proteins either. Although they are transcribed to precursors of mRNA, they are ultimately edited (spliced) out in the process of making mature mRNA and translating it.
The rest of the genome, outside of the confines of genes, is chock full of regulatory elements that modify the function of genes that may be located thousands or even millions of bases away. That regulation can be achieved in many different ways, such as coding for RNA transcripts and affecting how much of a protein is manufactured or how it is edited. In fact, so much of the genome codes for RNA transcripts with no direct protein-coding function that it has become clear that RNA—in such diverse forms as micro-RNAs, small interfering RNAs, long noncoding RNAs, and more—is
the big story of the genome. There may be more than 100,000 of these “RNA-only” genes (which are conceptually distinct from the original principle of gene as protein producer), more than four times the number of “traditional” genes. In 2007, the
Economist featured RNA on its cover with the story “Biology’s Big Bang: Unraveling the Secrets of RNA.”
4 This headline was an outgrowth of the large-scale, government-funded ENCODE project, a collaboration of thirty-five groups in eighty organizations around the world, which focused on developing an
Encyclopedia
of
DNA
Elements (a reach for an acronym).
Thus our sense of the operational aspects and function of the DNA sequence has been radically altered from the dogma of decades ago—far fewer genes than expected, outnumbered almost a hundred to one by the regulatory complex that controls it. The system seems to be just the opposite of our banking system and its regulation, in light of the collateral debt obligations and financial meltdown. With our genome, governance is hyperemphasized. This positions the genome to make a person’s biologic “meltdown” most unlikely, although not impossible.
Our DNA sequence comes in two categories, or versions: our germ-line DNA, representing the master DNA of the egg and sperm that created each of us and led to the formation of the cells in our body, and the copied, derived DNA, known as somatic DNA, found in our cells. The somatic DNA can develop mutations as cells replicate, as anyone familiar with cancer would know. It turns out, however, that not all mutations associated with cancer or other diseases are the cause of the condition; those that do are known as “driver” mutations, while others, known as “passenger” mutations, simply get dragged along by virtue of being physically proximate to the cause of the disease. The complexity here is that if one is looking for something that has gone biologically awry at the DNA level, it is necessary to zoom in on the DNA from the relevant cells. Accordingly, blood-derived (from white blood cells) or saliva DNA is typically assumed to be representative of germ-line DNA. But in a patient with a blood-borne malignancy, such as leukemia, this would not be the case.
For example, if we are trying to determine the root cause of a heart problem that was genetically based, we might look at the DNA from both the germ-line and somatic heart cells. In cancer, one might directly compare the DNA sequence from the tumor somatic cells to the germ-line sequence, known as paired sequencing.
5 This gets even more complex because in any tumor, there is heterogeneity of the DNA sequence, such that cells in different parts of the cancer tissue may have a different set of mutations. In another example of a brain disorder that may be genetically based, one would have to look at both the germ-line and brain tissue to determine whether any somatic mutations might account for the condition. To compound the complexity, genomic regulation varies at the local level. Micro-RNAs, RNA transcripts, small RNAs, gene expression, and epigenomics (discussed below) all exhibit tissue and cell-specific effects. Not really Genomics 101!
What’s more, the human genome varies from person to person. For any particular base in the genome there might be a difference in one person compared to another. The simplest and most common form of variation is known as a single nucleotide polymorphism (SNP).
6 Each variant is known as an allele. Each time another person’s DNA is fully sequenced, hundreds of thousands of new SNPs are discovered. Genomic scientists around the world are asked to deposit any validated SNP (replicated and cross-checked for accuracy) into a publicly accessible database called dbSNP.
The large international project known as the HapMap (short for “haplotype map”) was set up to determine the common human SNPs. In this project, 269 individuals representative of three major ancestries (African, Asian, and European) had their DNA genotyped to determine common SNPs, defined as occurring in more than 5 percent of the population.
7
A haplotype is a key unit of the genome, a “bin” or string of bases that has been inherited as a unit. The theory behind the HapMap was that if one could identify common SNPs, this would serve as a locator or zip code directory for the genome.
8 The bin or block has SNPs or alleles that are in “linkage disequilibrium” (LD), a somewhat misleading term that actually denotes that they are linked together. So if you have information about a zip code of the genome, tagged by a common SNP, you could explore which haplotype (bin) is associated with any particular condition of interest (like eye color, height, or a disease). In the average individual of European or Asian ancestry, there are over 500,000 LD bins or haplotypes. With African ancestry there is much more diversity, reflecting that Africa is the origin and extended period of evolution of human beings, so that there are approximately one million zip codes. Thus, in order to “tag” the human genome, you would need to genotype at least 500,000 common SNPs, assuming you had at least one tag per zip code.
SNPs are not the only form of architectural or so-called structural variation in the genome.
9 There are about one-tenth as many insertions or deletions, known collectively as “indels,” as there are SNPs.
10 (See
Figure 5.1.) This simply means that a base or series of bases have been added or deleted (for example, in the figure, out of the four bases, G, A, and T are deleted).
11 There can be block substitutions or inversions, as depicted. An important type of structural variant is known as copy number variation (CNV). Here quite a large block of the genome, which can even represent millions of bases, can be duplicated or missing in one individual versus another. The location of each of these variants could be determined relative to the original T substitution for a C.
FIGURE 5.1: Types of variation that occur in the human genome. The bottom panel for each category provides an example of that variation compared with the top sequence of single nucleotide variant. Source: K. Frazer, “Human Genetic Variation and Its Contribution to Complex Traits,”
Nature Reviews Genetics 10 (2009): 241–51.
A PEEK INTO THE GENOME (GWAS)
One important application of haplotyping is in genome-wide association studies (GWAS). These studies became possible because of jumps in the technology of genotyping. In 1997, we could only genotype one SNP at a time; by 2007 we could genotype one million SNPs in an individual using chips and automated robotic systems. GWAS is more than 99.99 percent accurate in correctly identifying the base (A, C, T or G) at a specific location. This is a remarkable feat of the technology, considering that the machines are handling a million SNPs, and it can be checked via other platforms that are designed to assess small numbers of genotypes with the highest level of accuracy. Coupling this technology with the zip code (HapMap) approach ushered in the GWAS era.
The first GWAS was published in April 2005.
12 That study was investigating age-related macular degeneration (AMD), the most common cause of blindness, which affects more than seven million Americans. The researchers examined 116,204 SNP tags in the haplotypes of ninety-six patients who had AMD and fifty suitable controls who were not affected. They found that variation in one specific zip code, a haplotype in a gene known as complement factor H (CFH) located on chromosome 1, was associated with more than sevenfold higher risk of being affected by AMD during one’s lifetime.
13 Sequencing this bin of the genome pinpointed the variation in an exon—a simple change coding for the amino acid histidine instead of tyrosine—as a root cause of the disease. Of note, three independently conducted and reported studies replicated these findings.
14
It was a stunning success for genomic science, with a limited number of patients and a minimal number of tag SNPs (at least 250,000 had been thought necessary for a cohort of European ancestry). The 700 percent increased risk of AMD disease susceptibility for the common SNP was striking, and a functional, culprit SNP was pinpointed by sequencing the incriminated genomic locus. Until this point, all we knew was that patients with the disease had inflammation of their retinal tissue, but there are thousands of genes that might have been responsible for such inflammation. A common, complex serious condition had been cracked by GWAS!
The term “complex” here is an important one. Before genomics, the only genetic diseases we had cracked followed simple Mendelian inheritance and expression patterns. They could be autosomal dominant (meaning only one copy on chromosomes 1 through 22 was necessary), autosomal recessive (requiring two copies for the disease to manifest), sex-linked (on either the X or the Y chromosomes), or mitochondrial (found in the power plants of our cells, which have their own DNA, reflecting their bacterial origins). These diseases are rare in the population and certainly not simple to understand at the genetic level. Examples are cystic fibrosis, Huntington’s disease, and Tay-Sachs disease; these and more than 2,000 others have been catalogued by the Online Mendelian Inheritance in Man database.
Besides being rare or common, there are critical differences between Mendelian diseases and complex traits. For Mendelian diseases, typically a mutation or different mutations in a single gene accounts for the disease. If a person has the mutation, it is highly likely that the disease will develop—a so-called highly penetrant mutation and a more deterministic story. On the other hand, each complex disease is caused by many different genes. These diseases do not follow a classical Mendelian inheritance pattern from one generation to another, and the variants responsible for them each have relatively low penetrance; their pattern of expression is probabilistic. Thus, in the complicated case of macular degeneration, we can only talk about the identified haplotype being associated with a higher likelihood, instead of a certainty, of developing the disease.
Although the macular degeneration study became a cause célèbre for the genomics community, there are two clear caveats lest we get too carried away. The first is simply that it was something of a lucky strike. Hundreds of subsequent GWAS of complex, polygenic diseases later, it has become clear that, with a few notable exceptions, lone haplotypes rarely signify such elevated risk for a specific disease. The second is that even one million SNPs represents only 0.03 percent of the genome, so this is simply a peek—even when we apply the haplotype approach. Nevertheless, GWAS has provided a veritable avalanche of data, the likes of which we have never seen in the history of genetics. In the early phase of GWAS reports, my colleagues and I published “The Genomics Gold Rush,” about the unprecedented chain of discoveries and excitement in the field.
15 This era was defined by a major publication and discovery of disease susceptibility genes nearly every week. The papers started to appear in mid-2007 and have continued to appear in
Nature,
Science,
Nature Genetics, and the
New England Journal of Medicine (four of the highest impact journals in biomedical research) almost every week since.
Figure 5.2 shows more than 1,200 GWAS studies with significant results published in leading peer-review journals; the genomic underpinnings for more than two hundred complex traits (predominantly diseases) have been mapped to the chromosome (and specific zip code) region.
16
FIGURE 5.2: Output of the genome wide association studies (GWAS) from 2005 to 2010. The complex traits’ “zip codes” are shown schematically by chromosome and location on each chromosome. Source: L. A. Hindorff et al., “A Catalog of Published Genome-Wide Association Studies,” Office of Population Genomics, National Human Genome Research Institute, National Institutes of Health, n.d.,
www.genome.gov/gwastudies.
GWAS represents a unique type of science that is hypothesis-free. Rather than postulating that a particular gene or panel of gene “candidates” is related to a disease of interest, GWAS represents an unbiased approach that lets the genomes of people “talk.” To start the hunt, there are no candidates, no nominations; if the results hold up, certain regions of the genome are elected and statistically incontrovertibly associated with the condition of interest.
Hypothesis-free research has been stunningly effective. Most of the genomic regions that have been discovered have never been previously theorized to have anything to do with the disease, such as the CFH gene in macular degeneration, FTO gene for obesity, TCF7L2 gene for diabetes, and a long list of examples.
17 GWAS has also revealed that multiple genes in a particular pathway may be involved. For example, Crohn’s disease, a debilitating disease of the small intestine, can be the result of problems with “autophagy,” or the self-digestion of our own cells.
18 Now through amassing serial, multiple large cohorts of individuals with and without Crohn’s disease, with cumulatively over 15,000 cases and 14,000 controls without the condition, more than seventy susceptibility loci have been identified.
19 Only a few of these loci are related to the autophagy defect, reinforcing that there are clearly many molecular forms of what has been diagnosed as the same disease.
For so-called Type 2 diabetes mellitus, genes implicate several different pathways.
20 There could be problems with the production of insulin, the secretion of insulin, the transport of insulin, or the reception of insulin. GWAS may ultimately help us more precisely determine the molecular basis of the disease in a particular person. Hypothesis-free GWAS has also revealed that the same genes may be implicated in multiple diseases. In Type 1 diabetes, thought to be an autoimmune process, nineteen of the first twenty-six genes GWAS linked to this disease were immune regulation genes.
21 Surprisingly, gene variants indicating diabetes susceptibility were also implicated in prostate cancer,
22 for example, as well as in multiple other cancers and multiple other autoimmune diseases.
Nevertheless, GWAS has many deficiencies. For example, over 80 percent of the genome loci incriminated are not in exons. And simply knowing the zip code that is implicated is far from pinpointing the mechanism for the disease susceptibility. Furthermore, the macular degeneration case notwithstanding, the actual culprit or functional SNP variant(s) has been left undefined in almost all diseases. Most of the zip codes are only denoting a small risk, typically in the range of 10 to 20 percent, and thus are not particularly strong signals. Importantly, association with a disease is a different relationship from predicting a disease. The susceptibility figures are derived from large populations, so trying to use the data for any particular individual is a jump in extrapolation. Virtually all of the GWAS represent a snapshot in time, a binary yes or no of cases versus controls, such that any putative risk is for the lifetime of an individual rather than an assertion that something might happen at a particular age. Beyond these many uncertainties, there are tens of SNP variants that have been associated with each of a variety of complex traits: more than seventy for Crohn’s disease, more than a hundred fifty for height, more than fifty for Type 2 diabetes, and so on.
23
Most of these variants have a statistical association that holds up under multiple comparisons: the probability that the findings are random artifacts rather than true findings is on the order of 1 in 100 million (this is known as the P-value of a finding). The clinical value, however, is much more suspect. It is even more complicated since we do not know in most cases if the gene variants interact, such that there might be an additive or multiplicative effect (not to ignore the possibility of subtractive interactions) if someone carries two SNPs of different implicated genes. This phenomenon of gene-gene interaction is known as epistasis and represents a gaping hole in our understanding of the dynamics of the human genome. This notion—that it is rare for an isolated gene to be the sole major player in the output of a series of genes—is the underpinning of systems biology, which studies the interactions of networks of genes. These networks are a complicated affair; they may at times require perturbations in multiple elements of a pathway or a “node” in the pathway for the outcome to be meaningfully different.
Moreover, in multiple studies comparing GWAS with traditional prediction by analyzing family histories of diseases, GWAS has provided little information that we didn’t already know.
24 Early studies in coronary artery disease and atrial fibrillation, in particular, highlighted this disappointment. Good old family history generally provided as much predictive capacity as genotyping in these conditions. Once again, this emphasizes the differences in advancing our understanding of the biology of diseases compared to prediction power of a “heredity horoscope.”
25
The difficulty of using GWAS data to try to predict susceptibility is in part related to the problem that the findings have only explained a small portion of heritability of the disease of interest. Heritability is usually defined by studying the differences between identical (monozygotic) and fraternal (dizygotic) twins. Most common diseases have an important heritable component, but GWAS has typically only explained about 10 percent of the story.
The unexplained 90 percent has become known as “missing heritability”
26 or, borrowing a term from cosmology, the “dark matter” of the genome. The prevailing thought before GWAS and the HapMap was that common variants would explain common disease. The macular degeneration study was an early fluke: because it accounted for most of the disease’s heritability, it was the source of undue confidence. But if we define “common” to be occurrence at the 5 percent frequency level, SNPs do not largely explain the heritability of common diseases.
TEN YEARS LATER: THE ANNIVERSARY OF THE FIRST HUMAN GENOME SEQUENCE
Given the difficulties facing GWAS, it might not be surprising that in stark contrast to the exuberant coverage of the first draft human genome sequence in June 2000, the media handling of its ten-year anniversary in June 2010 was distinctly negative and sobering. The headline of the
New York Times was “A Decade Later, Gene Map Yields Few New Cures; Medical Uses Limited; Despite Early Promise, Diseases’ Roots Prove Hard to Find.”
27 In its own editorial “The Genome, 10 Years Later,” the
Times wrote: “The task facing science and the industry in coming decades is as least as challenging as the original deciphering of the human genome.”
28 In
Scientific American, Stephen Hall’s article “Revolution Postponed” stated that “the Human Genome Project has failed so far to produce the medical miracles that scientists promised.”
29 Fast Company’s feature “The Gene Bubble” opened with the following: “When the human genome was first sequenced nearly a decade ago, the world lit up with talk about how new gene-specific drugs would help us cheat death. Well the verdict is in: keep eating those greens.”
30 Victor McElheny’s book
Drawing the Map of Life emphasized the “struggle for medical relevance.”
31 In the
Wall Street Journal Matt Ridley opined in “The Failed Promise of Genomics” that “it’s a curious fact that genomics has always been sold as a medical story, yet it keeps under-delivering useful medical knowledge.”
32 USA Today was a bit more sanguine with “The Human Genome: Big Advances, Many Questions.”
33 Interestingly, the
Economist had a distinctly upbeat interpretation of the decade after the Human Genome Project. In a special report entitled “Biology 2.0,” it proclaimed, “biological science is poised on the edge of something wonderful.” In response to widespread disappointment elsewhere, the
Economist declared, “Genomics has not yet delivered the drugs, but it will.”
34
The article “Major Heart Disease Genes Prove Elusive,” published in
Science , quoted me as saying, “At the end of the day, we have a bunch of loci and genes, but none of them do all that much to raise the risk of heart disease. Nor have they yet altered our understanding of how the heart falters—knowledge that will take time to develop.” Another cardiologist researcher from the Gladstone Institute, Deepak Srivastava, was even more pessimistic: “People did studies with 300 or 500 people and didn’t find anything, then did 1000 and didn’t find anything . . . in retrospect, GWAS wasn’t worth the expenditure.”
35
But virtually all of these articles missed out on a major wave of GWAS discoveries with the potential for immediate impact on the practice of medicine. These were not related to disease susceptibility findings but in fact GWAS probes of which genes were responsible for interacting with prescription medications—the field of pharmacogenomics.
PHARMACOGENOMICS
Hepatitis C is one of the most significant global health problems, affecting about 3 percent of the world’s population, or approximately two hundred million people.
36 It is the leading cause of cirrhosis and cancer of the liver. The standard treatment for hepatitis C is to eradicate the virus by treatment with PEG-interferon-α, combined with ribavirin, and given for a year. Treatment in the United States costs $50,000. It makes almost everyone who takes it ill, with symptoms of a flu-like condition. What’s worst about it, however, is that the drug only works in 50 percent of the people who take it, being generally more effective in those of European as compared with African ancestry.
In 2009, three different research groups did a GWAS to see if the drug response could be predicted—and it could.
37 A major signal was found at the gene IL28B with a single nucleotide variant that accounted for a twofold likelihood of a therapeutic response.
38 It also explained the disparity between ancestries, and the action of the drug fit precisely with a protein coded by the IL28B gene, also known as interferonλ3, which attacks pathogens. It was striking how small the number of patients was needed to generate this discovery—in one study, in Japan, GWAS assessed only sixty-four responders and seventy-eight nonresponders!
39 Genotyping IL28 SNPs could, at least theoretically, be immediately used to help predict which patients were likely to respond to conventional therapy. Between some newly approved drugs and more than twenty other drugs being actively pursued for Hepatitis C therapy, there are many other choices for those individuals who are not likely to respond to PEG-interferon-α.
Actionable Information
The PEG-interferon-α example, unlike that of macular degeneration, turned out to be quite representative of the fates of many other pharmacogenomic genome-wide (genome peek) studies that followed. One involved Plavix, the trade name for clopidogrel, the second largest prescription drug in the world with over $9 billion in sales in 2010. It blocks a platelet receptor known as P2Y12, which is an important mediator of the aggregation of platelets in clots. For many years, however, medical professionals were aware that the drug had quite variable effects in patients. Plavix worked well for some but in others had almost no effect.
As was well known at the time, the drug must be metabolized by the liver before it becomes biologically active. In 2006, a study in healthy volunteers showed that variants in the gene CYP2C19, which was involved in activating Plavix in the liver, played a role in the inconsistent effects of the drug.
40 A couple of years later multiple groups published definitive evidence that among patients who were having a stent placed in their coronary artery, the risk of the stent clotting was increased threefold if the patient carried a loss-of-function variant of CYP2C19.
41 And it was duly noted that these loss-of-function variants, particularly one known as the *2 allele, were extremely common—found in more than 30 percent of individuals of European ancestry, 40 percent of individuals of African ancestry, and about 50 percent of those of Asian ancestry.
42
Stents are the stress test for a platelet blood clot, since this foreign body of metal placed in the artery attracts platelets on its surface. Patients are routinely given aspirin and Plavix together for several months, or even years, to prevent a clot in the stent. Most stent clots result in sudden death or a heart attack, so they can be considered a real catastrophe. Fortunately, although two million coronary stenting procedures are performed worldwide each year, stent clotting occurs only in 1 to 2 percent of patients.
43 But they are the same people who are much more apt to have the CYP2C19 variants that cannot normally metabolize or activate Plavix. And 1 percent of two million people represent a lot of heart attacks or fatalities. So the stakes are quite high with Plavix pharmacogenomics, even if it does not involve hundreds of millions of people like hepatitis C.
Before a GWAS was performed, the investigators of Plavix response had been using a candidate gene approach, genotyping people for CYP2C19 because they considered it a likely explanation of some of the heterogeneity of Plavix response. But in 2009, Shuldiner and colleagues published a Plavix GWAS.
44 Their paper included a Manhattan plot (see
Figure 5.3) on which the x-axis represents the position on all the chromosomes and the y-axis representing the P-values (actually the–log
10 P, if one wants to get technical) of more than 300,000 SNPs throughout the genome. The phenotype—the clinical feature assessed—was the extent of platelet aggregation inhibited in more than four hundred individuals who had been treated with Plavix for a week. Such graphs are called Manhattan plots because they look something like the skyline of New York City. In this case, the GWAS revealed a lone “skyscraper” of many significant SNPs at a particular spot or locus in the genome, and it was in the CYP2C19 gene cluster.
45 This certainly didn’t account for all of the Plavix variable response, but it anchored the multiple studies that had preceded it as the most important common gene variant influencing Plavix response.
FIGURE 5.3: Genome-Wide Association Study of Plavix is shown schematically via a “Manhattan plot,” which looks for skyscrapers, and in this case the cytochrome cluster, responsible for metabolizing Plavix, was the only common variant found. Source: A. Shuldiner, “Association of Cytochrome P450 2C19 Genotype with the Antiplatelet Effect and Clinical Efficacy of Clopidogrel Therapy,”
Journal of the American Medical Association 302 (2009): 849–58.
This finding has had clinical applications in at least two places, Scripps and Vanderbilt University. At both hospitals, patients who are to undergo stenting are screened for loss-of-function variants founding the CYP2C19 cluster. If such variants are identified, patients can be switched to other medications (which do not rely on functional copies of these genes for activation) or given higher doses of Plavix, to which some patients respond.
Several other examples of GWAS have demonstrated the key genes involved in a drug’s therapeutic action. Warfarin is a commonly used blood thinner, with over twenty million prescriptions in the United States per year.
46 The drug helps to prevent stroke in patients with artificial heart valves or atrial fibrillation and the formation of blood clots in patients who have experienced deep vein phlebitis and many other blood-clot disorders. A GWAS for warfarin effect has shown three major genes that are critical to the drug’s response. One is VKORC1, which encodes an enzyme that enables vitamin K to play its role in blood clotting (warfarin inhibits the action of the enzyme). The other two are cytochromes, CYP2C9 and CYP4F2, which are active in the metabolism of the drug in the liver.
47
Patient response to warfarin is remarkably variable. Some patients require only 1 mg per day, whereas others require 20-mg doses. Genotyping up front could help avoid inadvertent underdosing and the possibility that a blood clot would form, and also reduce the likelihood of bleeding from overdosing. Other GWAS studies indicating gene variants that modulate drug efficacy are metformin, the most commonly used drug to treat Type 2 diabetes, and methotrexate, which is used to treat cancer and autoimmune diseases. Studies investigating the efficacy of routine genotyping have so far produced mixed results and have left the prospects for genotype-guided dosing unsettled to date.
GWAS of drugs for key side effects is the other side of the remarkable progress that has been made in this field. The Centers for Disease Control reports that almost 7 percent of hospitalization in the United States each year are related to adverse drug reactions.
48 Take the treatment for hepatitis C: it can cause a hemolytic anemia in about 15 percent of patients. GWAS has revealed variants in the ITPA gene that substantially account for the risk or protection from this side effect.
49 Statins, which are used to treat high cholesterol and are the most commonly prescribed drug class in the world, have the primary side effect of muscle inflammation. Common variants in the gene SLCO1B1, which is involved in the liver uptake of statins, are critical—patients with two copies of the variant carry an over twentyfold higher risk of developing severe muscle inflammation.
50 GWAS revealed an allele influencing worrisome reactions to the antibiotic flucloxacillin (Floxapen), which can cause liver toxicity. A variant known as HLA-B*5701 carries an eightyfold risk of liver injury.
51 In the statin and antibiotic GWAS, only eighty-five and fifty-one cases, respectively, were needed to cinch the findings!
52
Likewise, GWAS revealed the underpinnings of bad reactions to Carbamazepine (Tegretol), a drug used frequently for many neurologic conditions such as trigeminal neuralgia, epilepsy, diabetic neuropathy, and migraine headaches. The primary side effect of concern with this drug is a serious allergic reaction that can range from a skin rash to a life-threatening necrosis of skin throughout the body. In 2011, the risk allele for this side effect in individuals of European ancestry was discovered by GWAS with just twenty-three patients, and the HLA allele variant (HLA refers to the major histocompatability complex harboring much of the immune system genomic apparatus) carried over a twentyfold greater risk for severe skin splitting, or dehiscence, known as toxic epidermal necrolysis. Routine genotype screening in Taiwan (testing a different HLA risk allele in individuals of Asian ancestry) for all patients getting Tegretol prescriptions has been shown to dramatically reduce the risk (to zero of more than 4,400 individuals treated).
53
Another drug, the powerful anti-inflammatory cyclo-oxygenase-2 (cox-2) inhibitor drug lumiracoxib (in the same drug class of Vioxx and Celebrex), which had been marketed as Prexige in many countries, had to be taken off the market because of rare but severe liver toxicity. A GWAS showed that an HLA gene variant (HLA-B*5701) carries a fivefold risk of liver injury with this drug.
54 Such insight could even pave the way to bring this drug back—a “rescue”—with genotyping performed in advance to screen out individuals with high risk of liver damage.
Despite these successes, most drugs have not had a GWAS investigation. Even so, more than 25 percent of commonly used prescription medications have some genetic information that can be useful for guidance.
55 Many drugs used to treat cancer fall into this category: abacavir (Ziagen) (interacts with the same HLA allele as lumiracoxib, HLA-B

5701), 5-flourouracil (Efudex), Irinotecan (Campostar), azathioprine (Imuran), and 6-Mercaptopurine.
56 Other notable examples of drugs with strong genetic data (albeit not via GWAS) to guide selection or dosage include beta-blockers for heart failure; cisplatin, which can have the side effect of hearing loss in children; tamoxifen, which is used in treatment of breast cancer; metformin, which is prescribed for diabetes; and succinylcholine (Anectine), which is used to relax muscles during anesthesia.
57
The stark contrast between the success of GWAS in predicting drug reactions, and the relative lack of success in identifying disease susceptibility, seems to be a result of the actions of natural selection. Whereas there has been inexorable evolution of humankind in response to many diseases over hundreds of thousands of years, the individual’s exposure to drugs can be considered the “new, new thing.” There hasn’t been a chance for selection pressure in the genome for adaptation to drugs. Finding key gene variants that influence the response to a medication can be likened to hitting the side of a barn—an easy target. This dichotomy between disease-susceptibility genes and drug effects of genes, however noteworthy, should not imply that we’ll never get better at predicting disease susceptibility. After all, our knowledge of the variations in genes that drive both disease and side-effect susceptibility is far from complete. The obvious next step would be to go beyond peeking at genomes and look at every base, or at the very least every base of particular regions of the genome. Getting more granular, deep into the DNA sequence, will move the field forward.
PIVOTING TO SEQUENCING
We know now that the “common variant, common disease” theory does not largely explain heritability of complex traits and diseases, so genomics has turned to much rarer variants, digging down to the 0.1 percent and even lower allele frequency levels. Much work had indicated that such lower frequency variants would have a higher rate of penetrance. For example, for “good” high-density lipoprotein (HDL) levels in the blood, it has been shown that multiple rare variations in several genes, in aggregate, largely explain the relatively common trait of low HDL. Rare variants with high penetrance have been found in severe obesity, Type I diabetes, schizophrenia, and many autoimmune diseases. Of note, some of these variations are not in SNPs but instead are structural—either deletions or copy number variations (CNVs) (see
Figure 5.1, p. 81). This points to one of the other issues in missing heritability. While SNPs are the most common form of human genomic variation, indels, CNVs, and other structural variations are critical, too, and have not been adequately emphasized. SNPs in a GWAS can serve as a marker for some of the structural variations, particularly CNVs, but they are quite incomplete. The full map and disclosure of these structural variations has to rely on whole-genome sequencing of thousands of individuals with the condition (phenotype) of interest.
A good comprehensive sequencing investigation should be, like GWAS, hypothesis-free. There are two other hypothesis-free genomic approaches: exome sequencing and whole-genome sequencing.
EXOME SEQUENCING
Exome sequencing refers to sequencing the 1.5 percent of the genome that actually codes for proteins. This process enables us to detect whether there are functional variants—so called because they affect the structure and action of the encoded protein—that contribute to disease. Given its size, it is much easier to work with the exome than the whole genome in order to sort out the needles from the haystack; the former might have tens of thousands of variants, whereas the whole thing can have half a million or more. The trick is finding the one or ones that are functional. The best means of determining whether a genomic variation is functional is to breed mice with the variant (the so-called orthologous mutation, or the equivalent of the human genomic change in the mouse genome) and see whether the mice recapitulate the disease phenotype. But lesser forms of support or proof are becoming acceptable, such as the in silico computer-based prediction of whether a variant in a coding element would significantly change the protein structure or binding properties (e.g., by disrupting an enzyme’s catalytic site—essentially the business end of the stick). For the rest of the genome outside of the exons, this task is remarkably more complicated—we don’t yet have the tools to predict likely functionality of genomic regulatory changes.
The relatively low cost and quick completion of exome sequences, as compared to whole-genome sequences, has recently made the exome the hot, go-to method for cracking diseases. As a result, the previously unknown root causes of a variety of rare, Mendelian diseases were determined via exome sequencing. Several cancer types, such as ovarian clear cell and uveal melanomas, had key mutations determined via exome sequencing. The discovery efforts extended to unexplained mental retardation, severe brain malformations, and even adaptation to high altitude among Tibetans.
58 Thus, there was a remarkable body of progress in just the first year of exome sequencing, with genomic science groups all over the world in hot pursuit. A second phase of the genomics gold rush was officially under way.
The exome is not without shortcomings, however. This approach is not truly “hypothesis-free,” since for it to work it requires that important variations exist in the exome. To really solve the problem of inherited disease, the full monty would have to be pursued—whole-genome sequencing on a large scale was the inevitable next step.
WHOLE-GENOME SEQUENCING
As a colleague of mine and I wrote in 2007, “Ultimately, when whole genome sequencing is practical and affordable, it will be increasingly difficult for the genomic basis of health and disease to be left undetected.”
59 We aren’t at the practical and affordable ideal yet, but the progress that has been made in the past few years has far exceeded virtually anyone’s expectations.
The initial efforts in the 1970s were essentially manual, done by tagging bases with radioactive isotopes. Gel-based systems arrived in the 1980s and were automated by 1990; even at that point, however, they could only sequence fewer than 10,000 base pairs per day at a cost of $10 per base. By the mid-1990s capillary sequencing took root, and the method gradually increased the output from 15,000 base pairs per day to over one million, and the cost dropped to about $1 per base called.
60 What has put exome and whole-genome sequencing in reach of people like Nicholas Volker’s doctors are the “massively parallel sequencing” machines, which rely on simultaneously reading hundreds of thousands of short stretches of sequence. This has raised output from twenty million per day (on the 454 Life Sciences platform) in 2005 to twenty-five billion (Illumina HiSeq and Life Technologies SOLiD 4) in 2010. In just five years, the speed increased a thousandfold while the price per base fell from $0.01 to $0.000001.
61 The rate of improvement has left Moore’s law for computing in the dust.
62
Considering the required time and cost for some of the landmark projects helps to add perspective. The first human genome sequence, which actually represented a hodgepodge of multiple individuals, took thirteen years and $2.7 billion. In 2007, Craig Venter’s genome took four years and about $100 million. The Watson genome in 2008 took only four months and just $1.5 million.
63 By November 2008, multiple human genomes were sequenced in one to two weeks for less than $100,000. In 2009 Stephen Quake, a professor at Stanford, sequenced his own genome in a week for less than $50,000.
64 By 2010, one of the leading genome science companies, Illumina, offered whole-genome sequencing for $28,000, and the newcomer Complete Genomics announced that they would soon perform whole-genome sequencing for $5,000. By the end of 2009 they published in
Science multiple human genome sequences and declared “the high accuracy, affordable cost of $4,400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variant in large-scale genetic studies.”
65 By the end of 2011, Complete Genomics was sequencing approximately 1,000 whole human genomes per month.
This high throughput is amazing, but it is only part of what’s necessary to make whole-genome sequencing clinically important. The other side of the coin is accuracy, and indeed, that is what has kept the Archon X prize of $10 million—to be given to the first team to sequence a hundred human genomes in ten days—from being awarded.
66 The accuracy of sequencing platforms is dependent on the depth of coverage, depth being a measure of how many times an average base is read during sequencing. If the average is forty times, this is commonly regarded as a deep sequence, although you should bear in mind that, because it’s an average, some bases are going to be read a hundred times, and others only ten. At some point there appears to be a “saturation” point at which further depth does not improve accuracy to any measurable or substantial degree.
The other key metric about sequencing is the length of the read. This was not much of a problem in the capillary sequencing era, when the classic Applied Biosystems machine had read lengths approaching 1,000 bases. Massively parallel sequencing started with read lengths of less than 40 bases, and these days it gets up to a few hundred. Unfortunately, short read lengths are unhelpful and can even be misleading when looking for structural variations such as indels, CNVs, inversions, and the like. Remember that we are trying to pick up rare and very low frequency genomic variants, occurring at a frequency of 1 percent or far less in the population. If you are looking for a variant that occurs in only 0.1 percent of the population, and the accuracy of the sequencing is 99 percent, the 1 percent false reads of 3 billion base pairs will yield three million false positive variants. In such a case, the number of correctly identified rare variants would also be about three million, meaning that 50 percent of all positives would be false positives. Sorting these out would be a mess. Accordingly, everything that can be done to improve depth of coverage and read length, and promote as close to 100 percent accuracy as possible, is required to make sequencing and the downstream interpretation of the data practical. Despite the extraordinary reductions in cost, we aren’t quite there yet. But that doesn’t mean the competition to get there isn’t fierce.
The level of competition was exemplified by a scene I witnessed at an annual meeting about genomic sequencing held in February 2008 on Florida’s Marco Island. Every genomic sequencing company gets fired up for this meeting, and they hold nothing back to imprint themselves and their products on the memories of attendees, of which there are more than a thousand. A different sequencing company—Illumina, Life Technologies, Pacific Biosciences, or Helicos—sponsored each meal of the conference. Pacific Biosciences, a company that until that point had been operating in stealth mode, was treating the event as a “coming out” party, with an evening of fireworks and celebration on the beach. Even my hotel room key had the name and logo of one of the sequencing companies on it.
I participated in a panel, sponsored by Pacific Biosciences, on how much sequencing capacity would be needed to make the biggest advances in medical genomics. As usual, the emphasis was on more done in less time; at the meeting, the company’s representatives spoke about sequencing a whole human genome in less than fifteen minutes at some point in the future. (Three years later, however, they had not yet sequenced one whole human genome.) I think the panel was missing a fundamental issue, and it occurred to me while I was there. Everyone was saying the usual things about accuracy, coverage, read length, throughput, and cost. But what, I wondered, about the people and patients who would be sequenced? In order for us to make any substantive advance on missing heritability, we need to whole-genome sequence large numbers—thousands—of people with a particular condition and compare the findings with suitable controls without the condition. What’s more, we would need to make sure that the controls, to truly be controls, would never develop the condition. Otherwise, they wouldn’t really be controls.
The fact of the matter was that no one was talking about that kind of study. The 1000 Genome project, an international, government-funded collaboration, was just getting under way. It sequences randomly selected, anonymous subjects whose health status is unknown, predominantly at a low depth of coverage. There was no plan at that time, or even a few years later, to tackle the challenge I envisioned, with just a few exceptions, such as a large sequencing project of thousands of individuals with Type 2 diabetes. A big reason is expense: even if one could go with the Complete Genomics or Illumina price of a whole sequence for less than $5,000 by 2011, any such study would still have $10 million or more in sequencing costs.
67
As Kevin Davies wrote in his 2010 book,
The $1000 Genome, the price of sequencing keeps falling.
68 But even if we were to reach a $1,000 genome, that would not capture all the costs. Having the sequence is not the whole story, and the rate-limiting step is the in-depth interpretation, which nowadays is estimated to cost several hundred thousand dollars. If we get thirty-times coverage of each person’s genome, we would have ninety billion bases to sort through. These, of course, would be in little slices, which would then need to be assembled and compared with the human reference genome and annotated with respect to all of the known functional variants. Here is where the “quants” (the new breed of math whizzes) take over: major analytics, computational biology, and informatics are used to crank out the meaningful findings, moving from raw sequence data to real information. The typical human genome sequence yields around three million sequence variants compared with the reference genome. Approximately 100,000 SNPs are deemed “novel” (first discovered in the individual being sequenced), and of these some 15,000 to 20,000 occur in exons. As more and more individuals are whole-genome sequenced, the lower frequency SNPs will be found and deposited in dbSNP, and the numbers of novel variants will progressively go down.
Clinically annotating the human genome is another big step. Stephen Quake’s paper reporting the findings of his rapid self-sequencing adventure had three authors, and it reportedly took a week to obtain the raw data. But when his genome was annotated, it took thirty-one authors several hundred hours to scour through the genomic variants and all of the related literature to figure out which ones were clinically meaningful. His risks of diabetes, coronary artery disease, and obesity were noted, but of particular interest were the sixty-three predicted pharmacogenetic interactions, which included inability to metabolize Plavix, needing a low dose of warfarin, and unresponsiveness to beta-blockers and routine anti-diabetic drugs. Another rare variant was that he was a carrier for cystic fibrosis, of which he was previously unaware and might have been useful information in conjunction with similar data for his wife, since this is a Mendelian recessive trait requiring two copies of the rare variant. (Also of note was that Stephen had a cousin who had died suddenly at age nineteen, cause unknown. In 2011, it was announced that the cousin’s DNA would be sequenced in the first “molecular autopsy.”)
69
So even though there is much we can learn from sequencing whole genomes, it has been—and remains—an expensive proposition. Hence its clinical applications have been limited. Thus far the strategy of whole-genome sequencing (WGS) has predominantly been applied to cancer. By sequencing paired tumor and germ-line DNA, the ability to find driver mutations has been demonstrated for a number of cancers. The first case was that of a leukemia patient, and a small subset of eight genes thought to drive the cancer was identified. To find them, the researchers sequenced ninety-eight billion base pairs from the leukemic cells (the genome being sequenced thirty-three times) and forty-two billion base pairs from germ-line (skin tissue was used, sequenced fourteen times).
70 Several solid tumors, including lung, breast, and pancreatic cancer, have since been sequenced.
71 In small-cell lung cancer, WGS identified signatures related to tobacco exposure.
72 WGS has been applied to a family of four individuals with a rare Mendelian trait known as Miller syndrome,
73 and it provided some incremental findings beyond exome sequencing, which had been previously applied for this same trait (a second gene mutation was identified). It has been used for finding the genetic defect in Charcot-Marie-Tooth in a particular family and applied to discordant identical twins with multiple sclerosis (one affected, one not).
74 In the latter study, surprisingly no explanation was found for this neurologic condition—yet another indicator of how complex the human genome has proven to be.
SEQUENCING TO SAVE A LIFE
The case of Nicholas Volker, touched on in the opening of this chapter, demonstrates the real potential of defining life codes. In medicine we often use the terms “idiopathic” or “cryptogenic” when we really mean “we don’t know.” Such was the case with this child, who would develop a fistulous connection from his intestine to the skin of his abdomen whenever he ate. This led to more than a hundred surgeries to repair the bizarre channels, a clinical condition that had not been seen before. Ultimately, when all hope was about to be given up, and Nicholas had been enduring protracted hospitalization, intermittently septic, and living in a hyperbaric chamber at times, his pediatrician at the Medical College of Wisconsin asked for Nicholas’s genome to be sequenced.
The results were completely unexpected—a mutation in the gene XIAP, known to be pivotal in the activity of the immune system, was responsible. Happily, there was a way to fix it—an umbilical cord blood transplant to replace the stem cells responsible for producing white blood cells, one of the cornerstones of our immune response. Such an operation hadn’t even been under consideration. This genomic knowledge and the decision it induced wound up transforming Nicholas from near death to a healthy five-year-old boy. Now at the Medical College of Wisconsin more than forty children in the queue are being considered for sequencing to define, at the molecular DNA level, why they are sick. A committee of physicians, geneticists, genetic counselors, and ethicists meets on a regular basis to consider whether a child qualifies for sequencing after conventional medical measures have been exhausted. And that same committee decides the priority of the patients, determining who will be sequenced next.
The Volker case was groundbreaking, and for the first time, it is possible to imagine a future medicine that has no use for the terms “idiopathic” or “cryptogenic.” Many people go from one medical center to another because their condition has not been diagnosed, effective therapy cannot be found, and they are suffering. The fact that the committee in Milwaukee was able to secure reimbursement for sequencing in select cases is particularly intriguing, since one can certainly make a case that sequencing early in life could prove to be cost-effective over the long term: it would preempt extensive and expensive medical evaluations, not to mention totally ineffective treatments.
In 2011, whole-genome sequencing of fraternal teenage twins who had a serious movement disorder (dystonia) led to the precise molecular diagnosis and highly effective therapy.
75 The case of these twins, whose parents were healthy, is particularly instructive and representative because their causative mutation was due to a “compound heterozygote,” meaning that mutations in two different bases in the same gene, one from the mother, one from the father, came together to induce the movement disorder. This is a common explanation for why a new condition appears without it ever having shown up in the prior generations, and it will certainly be a common finding in the sequencing era going forward.
A family in Ogden, Utah, had five children in two generations die from a mysterious accelerated aging condition that had never been seen before, but through sequencing this has now been pinpointed to a gene mutation on the X chromosome.
76 For this family, the Ogden Syndrome, as it is now being referred to, could be prevented by in vitro fertilization with selection of the embryo that does not carry this mutation.
The root cause of the rare Proteus syndrome, which is believed to be the basis of the disfigurement of Joseph Merrick, known as the Elephant Man, has now finally been identified as a gene mutation in AKT1, which is also a mutated gene in many cancers.
77
Although encouraging and intuitively positive, we need to see many more cases like this before it can be considered a new path to demystifying serious, unknown medical conditions. Currently, the ability to sequence is way out in front of our ability to interpret the data. Annotating human genomes to accurately process and interpret the enormous data derived from each individual’s DNA, separating the noise (changes in bases or structure that is irrelevant) from the signal (the culprit, functional variation) remains a formidable challenge. This will become less daunting when the genomes of hundreds of thousands of people, representing the full gamut of phenotypes, are sequenced and fully annotated.
THE RACE TO AMP UP SEQUENCING
The sequencing race is multidimensional. It is being waged across countries and continents, across the major companies and technology platforms, and throughout the academic genomic science community. The Beijing Genomics Institute headquartered in Shenzhen, China, has purchased more than 100 of the latest generation sequencing machines (predominantly HiSeq) and has projected completing 20,000 human genomes by the end of 2011.
78 North America and Europe, particularly the United States and the United Kingdom, have over 1,000 of the HiSeq or SOLiD sequencers and project about 10,000 human genomes will have been sequenced by the end of 2011. Complete Genomics, which like Pacific Biosciences became a publicly traded company in 2010 (as compared with Helicos, which was delisted in the same year), is currently sequencing thousands of human genomes each month and projects to be able to provide the life codes for a million people in five years. Their unique “mail order” WGS model is an interesting twist that preempts academic or life science industry laboratories from having to purchase the very expensive sequencers and proprietary reagents, along with the significant costs of specialized personnel. This model could make raw sequence determination a commodity, but time will tell if this is a financially viable model and whether the genome product will be widely accepted by the most accomplished academic sequencing centers.
Another new competitor on the scene in 2010 was Ion Torrent, which was acquired by Life Technologies and offers a smaller “desktop” sequencing platform that is not intended for WGS but might be especially applicable for medical sequencing, such as rapid sequencing of a bacterial pathogen to be able to predict whether a particular antibiotic would be effective (like MRSA, methicillin-resistant staphylococcus aureus, which can be a notoriously deadly infection). Ion Torrent’s desktop platform was the first to be based on semiconductors, perhaps representing the most important shift of platform for sequencing. In 2011, the first human genome to be sequenced by transistors was Gordon Moore, one of the most prominent pioneers of semiconductors. Although it took about 1,000 Ion Torrent chips, the number of transistors required to sequence a whole human genome is expected to drop exponentially in the next couple of years.
79
This represents the ultimate convergence between transistors and sequencers—DNA transistors. In December 2010,
Scientific American published their “World Changing Ideas” top-ten list; highlighted on the cover was the emergence of DNA transistors as “innovations for a brighter future.”
80 Unlike the cumbersome reagent costs and optical instruments needed to read fluorescent tags, this process is a radical redesign of genome sequencing: “Whereas existing dishwasher-size sequences require expensive chemical reagents to analyze genes that have been sliced into thousands of small fragments, the so-called DNA transistor takes an almost naively simple approach.” Chris Toumazou, director of the Institute of Biomedical Engineering at Imperial College, London, is a principal advocate for using semiconductor technology to achieve human genome sequencing. He expects semiconductor technologies could enable the sequencing of a “full genome on a chip within a matter of minutes.”
81 An already existing prototype of a portable sequencer is the size of a cell phone. If this results in a successful, functional technology, it would take the sequencing and medical genomics world by storm.
Because of all of these platforms, it has been projected that 250,000 human genomes will be fully sequenced by the end of 2012, 1 million by 2013, and 5 million by 2014. Clifford Reid, the CEO of Complete Genomics, projected in 2010 that within five years his company alone would sequence 1 million human genomes. So in the next five years, there ought to be more than enough WGS to digitize the life codes for most human conditions.
82
PROGRESS IN CANCER GENOMICS
Cancer is a genomic disease. It can’t develop without a change in the DNA sequence of the cancer cell genome. While the germ-line DNA may have susceptibility genes that predispose an individual to cancer, somatic cells have developed mutations that have overridden the normal DNA housekeeping repair processes. There is a remarkable spectrum of changes that can occur in the genomes of cancer tissue, ranging from point mutations to the structural variations I have been reviewing in this chapter (insertions, deletions, CNVs). One structural variant is particularly prominent and, in many examples, fundamental: chromosomal rearrangements.
83 These rearrangements involve broken DNA strands that have been combined somewhere else in the genome, either on the same chromosome (intra-chromosomal) or on another chromosome (interchromosomal).
Often these chromosomal rearrangements can have an activating role, whereby a “fusion gene” is created, combining two previously separate genes into a hybrid that in some cases markedly promotes growth and replication, fulfilling the “oncogene” functional category—a gene capable of inducing cancer (“onco” denotes tumor or cancer). Whereas fusion genes were first found in the “liquid” tumors—leukemias and lymphomas—they are now being recognized as occurring frequently in solid tumors, such as prostate and some types of lung cancer (adenocarcinoma). Besides the rearrangements, other types of genomic alterations in cancer include new DNA integration, particularly from a virus, which can cause certain cancers; those involving mutations in the 17,000 bases of the mitochondrial DNA; and epigenomic changes, in side chains of DNA and the histones that package the DNA (discussed later in this chapter).
84
Not to be discounted, environmental exposures such as tobacco or ultraviolet light can induce somatic mutations through many of these mechanisms and are an important part of the story. In parallel to the diverse spectrum in the kinds of mutations possible, there is a wide range in the numbers of mutations. A cancer might involve no rearrangements, or it might involve hundreds; likewise, there can be fewer than 1,000 point mutations or more than 100,000. In an individual cancer, it is thought there might be as many as 20 “driver” mutations responsible for taking the cells off track.
85
Driver mutations are the functional force behind cancer and can confer resistance associated with recurrence of tumor growth. Out of 22,000 protein-coding genes, about 350 to 400 have recurrently shown up with somatic mutations driving the growth of a tumor. They are disproportionately from specific gene families, and one of the most important is the protein kinase.
By determining the importance of many of these protein kinase genes (such as epidermal growth factor receptor-EGFR, KRAS, and BRAF in particular tumors), there have been some striking advances in cancer therapy. Traditional chemotherapy relies on multiple toxic drugs (with or without radiation) that kill cells indiscriminately; the new therapies are much more specific. One of the most dramatic examples is the specific point mutation (V600E) in the gene BRAF, a driver mutation that is found in about 60 percent of patients with malignant melanoma. Despite multiple chemotherapies and radiation, nearly all patients with malignant melanoma die within one year of diagnosis. But now with an orally active BRAF mutation-directed drug that specifically binds the mutated protein, more than 80 percent of patients respond with rapid tumor shrinkage in only two weeks. The highly specific nature of this benefit is emphasized by the observation that patients not carrying the BRAF mutation actually get worse with the drug.
86 This is a prototypic example of individualized treatment of a particularly lethal form of cancer.
There are many other mutations with biologically based drug coupling that have led to enhanced efficacy of treatment. Two well-known drugs are Gleevec for chronic myelogenous leukemia, which targets a fusion gene, and Herceptin for breast cancer, which targets the HER2 estrogen receptor, but there are many more recent examples. The drug Erbitux (cetuximab), which was tied to the major media exposé involving Martha Stewart and the CEO of ImClone, Sam Waskal, is commonly used in colon cancers. However, it is not effective if a patient carries a KRAS mutation. For a subtype of lung cancer known as non–small-cell, the drug Iressa (gefitinib) has proven to be remarkably useful when there are activating mutations in EGFR. In the same type of lung cancer, patients who have tumors lacking a mutation in the gene ERCC1 have a marked benefit with cisplatinum chemotherapy. The drug Sorafenib inhibits the cancer gene RAF and has been shown to improve outcomes in patients with cancers involving this driver mutation—particular individuals with kidney, liver, lung, and thyroid cancer. For tumors that have an ALK gene fusion, which occurs in certain lymphomas and non–small-cell lung cancer, the experimental drug Crizotinib has been demonstrated to have a high response rate. In certain types of brain cancer, glioblastoma, the drug Temodar (temozolomide) is effective for when the MGMT gene is methylated. The mutated gene PIK3CA (phos-phoinositide 3-kinase, PI3K) has been shown to be present in a variety of tumors, such as ovarian, colon, certain brain tumors, and others, and a number of drugs have been designed to target this cancer gene and are in the midst of clinical testing.
87
With the ongoing efforts in cancer mutation screening and sequencing, the discovery of targeted cancer therapies is clearly accelerating. We are learning that the same driver mutation can be found in a diverse set of cancers, such as BRAF in not only melanoma but also colon cancer and thyroid cancer. In one of the first documented cases of sequencing a tumor and tailoring its therapy, Dr. Marco Marra of the University of British Columbia in Vancouver had a patient in his eighties with tongue cancer that had metastasized to his lung.
88 The sequence of the tumor found a driver mutation in the cancer gene RET (abbreviation for “rearranged during transfection”), which led to a successful outcome using a targeted drug that would not have been considered. So connecting the dots between a mutation and a drug may prove to be more useful than guiding therapy by the type of cancer (e.g., colon or lung), which turns out to be too imprecise.
The controversial codiscoverer of the DNA double helix, James Watson, an octogenarian, recently stated that cancer can be cured in his lifetime—at least with some qualifications. He said, “I would define cancer cured as instead of only 100,000 being saved by what we do today, only 100,000 people die. We shift the balance.” With sequencing, he believes we’ll know all of the genetic causes of major cancers in the next few years.
89
Although the cancer genome can be remarkably chaotic and is much more complex to accurately sequence and analyze than germ-line DNA, it is clear that sequencing tumor DNA could be the first step in treating a cancer patient. Many have projected that this will likely become commonplace at some point of the future—digitizing the cancer in any individual who is affected. With the same issues of practicality and affordability as WGS, cancer sequencing holds considerable promise. In the next few years, thousands of cancers will be sequenced, and it will ultimately be determined whether a large panel of driver mutations is a useful tool as compared with exome or WGS sequencing.
BEYOND SEQUENCE: RNA, PROTEIN, METABOLITES, EPIGENOMICS
DNA sequence has been the main topic of this chapter and the field in general, but the scientific efforts and progress go well beyond it and touch every aspect of what we collectively term the “omics.” Included in this general term are transcriptomics (the RNA transcripts from genes or noncoding RNAs), proteomics (all the proteins that are translated), metabolomics (all the metabolites), and epigenomics.
International scientific efforts are mapping each of these. One example is the Human Protein Atlas, led by researchers in Sweden, India, China, and South Korea, who are defining the human proteome. This, by itself, is a herculean task, since even though there are only about 22,000 genes, there is not a 1:1 relationship of genes to proteins. Proteins undergo alternative splicing and cells can change proteins after they are produced (so called posttranslational modifications), so they can appear in any given individual in many forms, and it is estimated that the human proteome comprises more than one million distinct proteins.
90
Sequencing the transcriptome, the set of all RNA molecules, provides a digital readout and has gained popularity over its precedent, gene expression profiling, providing insight on what the genome is doing—its activity—at a given moment in time in a particular tissue. Gene expression profiling is used clinically for prognosis in breast cancer, as a means of monitoring rejection after transplantation, and has recently been introduced as a way to diagnose the presence of coronary artery disease.
91 (A study led by our team at Scripps in coronary artery disease gene expression work was recognized as one of the top ten medical breakthroughs by
Time in 2010.)
92 In each of these examples, large studies were performed using cancer tissue or, in the case of transplantation and coronary disease, using white blood cells to determine the relevant set of gene transcripts that would track with the clinical condition of interest. Whole-transcriptome sequencing, known as RNA-Seq, has already proven especially helpful for identifying fusion genes in cancer.
Metabolomics is the study of all of the metabolites or cellular end products. It relies heavily on the technique of mass spectrometry, which is used to both identify and quantify the unique chemical fingerprint that represents a particular metabolite. In conjunction with the other omic tools, such as a gene mutation in a coding region, or an abnormal protein, this can determine whether a novel metabolite or pattern is associated with a disease or condition of interest.
The field of epigenomics has really exploded in recent years.
93 The term “epigenetic” has evolved to encompass heritability not related to DNA sequence. There are many examples of how environmental conditions can influence heritability through the generations: people who smoke cigarettes in their youth affect the age of puberty in their children’s children; mothers exposed to famine have specific imprinting on the side chains of the insulin growth factor 2 gene (IGF2) six decades later.
94 This transgenerational impact of environmental conditions helps to make sense of and coalesce the long-standing nature versus nurture debate.
Accordingly, there is no such thing as “identical” twins. Even twins emanating from the same embryo—monozygotic—with the same DNA sequence have different epigenomic marks, chiefly influenced by differences in the maternal in utero or perinatal environment. These marks can be manifest on the elements that package the DNA in a number of ways (
Figure 5.4): by attaching methyl groups to the side chains of genes (particularly at C or cytosine bases), by changing the chromatin pattern (open or closed), or by affecting the histones (attaching an acetyl or methyl group).
Along with multiple international “big science” programs, a large U.S. government-funded collaborative initiative known as the Roadmap Epigenome Project is dedicated to unraveling the epigenome’s complexity. In October 2009, Joseph Ecker and his colleagues from the Salk Institute in La Jolla published the first human methylome,
95 using both skin and stem cells. Their achievement was recognized by
Time as a top ten scientific discovery for that year.
96 While epigenomic modifications that often result in turning off or “silencing” effects on gene expression were first emphasized in cancer, more recent studies have highlighted their contribution in many other diseases, including mild cognitive impairment and Alzheimer’s, behavioral disorders such as schizophrenia, and metabolic diseases of obesity and diabetes.
97 Many studies of Type 2 diabetes, especially, have highlighted epigenomic effects. These include the role of open chromatin for the principal susceptibility gene TCF7L2; the importance of parent-of-origin se-quence variants in passing down risk alleles (whereby, for example, only the SNP of paternal or maternal origin is critical for transmitting susceptibility); and methylation of key genes in muscle biopsies of diabetics.
98 In experiments with rats, fathers with chronic high-fat diets have daughters who develop diabetes; the process is thought to be mediated by a change in the methylation pattern of several genes.
99 The
Economist, which doesn’t miss much in the meaningful progress of genomics, had an article on the origin of diabetes. “Don’t blame your genes,” the headline said. “They may be simply be getting bad instructions—from you.” The illustration was a picture of a woman eating an ice cream sundae, with the caption “piling on the methyl groups.”
100
FIGURE 5.4: Schematic of DNA to show methylation of the cytosine side chains, the histone and chromatin packing of DNA, and other features, including noncoding RNAs.
There is one more “ome” that deserves more than mention as we wrap up the omic landscape—the microbiome. This refers to the species of bacteria, fungi, and viruses that live in each of us. There are efforts throughout the world to map the microbiome: the MetaGUT project in China, the Human Gastric Microbiome in Singapore, the Human MetaGenome Consortium in Japan, MicroObes in France, the Australian Urogenital Microbiome Consortium, the Canadian Microbiome Initiative, and the U.S. Human Microbiome Project.
101
In our intestine alone bacterial species contain more than a hundred times as many genes as our own human genome. In fact, there are about a hundred trillion microbes in our gut, as compared with ten trillion cells in the human body. Many microbiomes have been linked to diseases—such as the gut microbiome and obesity or heart disease, the airway microbiome and asthma
102—and this whole area is just beginning to take off because of the power of current ultra-high-throughput sequencing platforms. A recent editorial in the
Journal of Clinical Investigation, “Which Species Are in Your Feces,” pointed out the remarkable value of high-throughput sequencing of the intestinal microbiome to determine antibiotic-resistant, high-risk populations.
103 Some researchers have even resorted to fecal content transplants from healthy patients to those who harbor an extremely difficult bacterium to treat—
Clostridium difficile—as a potential probiotic therapy.
104 In 2011, a major advance in the human gut microbiome was made with the discovery of distinct “enterotypes” from hundreds of individuals. There were three different patterns of microbiomes of the gut, characterized by dominance or overrepresentation of a specific bacteria species:
Bacteroides,
Prevoella, or
Ruminococcus.
105 The specific enterotype may influence the risk of colon cancer, obesity, and metabolic disorders; the response to many medications; what we should eat; and the optimum antibiotic to use in case of an infection. It’s a big new wrinkle in the new science of individuality.
PERSONAL “CONSUMER” GENOMICS
We now progress to a central topic—how will knowledge of my genome prevent diseases and keep me healthy? When are the data ready for prime time?
The field of personal genomics has been mired in controversy, which got particularly sonorous in late 2007 when two companies—DeCode Genetics and 23andMe—commercialized genome-wide scans for the public, followed soon thereafter by Navigenics. These companies offered the same genotyping chips that were being used in the GWAS studies, assaying 500,000 SNPs, and they provided a readout of risk for complex traits and diseases. The kits were available to order over the Internet and the data provided over the Web. The high cost attached to the service—initially ranging from $995 for DeCode and 23andMe to $2,500 for Navigenics—included updates on a frequent basis as new data became available. Since GWAS results were being published in major scientific journals virtually every week, this was a key feature. The expensive Navigenics package included a telephone session with a genetics counselor to review and interpret the results.
In late 2007, I was one of the first subjects to test the Navigenics service, before it was commercially released. The security involved was extraordinary. Once I received an email notification that my results were ready, I had to use multiple passwords and verify my identity using my cell phone to finally get into the site and get my results. Since I’m a cardiologist with no family history of heart attacks, the first thing that caught my eye was this: “Heart Attack—Average Lifetime Risk 52 percent, Your Risk 101 percent.” I was pretty stunned, not least because I had no idea how my risk of heart attack could be greater than 100 percent! I remember calling my wife that evening and suggesting that I might not make it home for dinner. I later notified Navigenics, and they confirmed that it was a mistake. It was a good thing they were not already marketing the test yet to general consumers!
Figure 5.5 shows my updated Navigenics output for disease susceptibility for twenty-five conditions. At the top of each column is the lifetime risk category, ranging from less than 1 percent on the far left to more than 50 percent on the far right. The darkened boxes represent conditions for which my risk exceeds that of the general population. I have increased risk of osteoarthritis, at 56 percent in my lifetime compared with 40 percent in the general population. In fact, I already have this problem, so that was accurately forecasted, but doesn’t everyone get osteoarthritis if they live long enough? Going back to the heart attack risk, it looks a lot better now than the first pass, but my risk is 30 percent, or 12 percentage points, higher than that of the general population of men. Notably, I have more than a threefold increased risk of developing Alzheimer’s disease because I carry one copy of the well-documented apoε4 risk allele. If I had two copies of this allele, my risk of developing Alzheimer’s would be increased at least tenfold. I am also at heightened risk for brain aneurysm (although the absolute risk for me is still less than 1 percent), abdominal aorta aneurysm (less than 4 percent), and atrial fibrillation (risk of 33 percent, compared with 26 percent for the general population).
It’s not immediately clear what the clinical importance of any of this information is. Take colon cancer. Even though both my maternal grandparents and my paternal aunt died of colon cancer, and I have been undergoing colonoscopies every five years since age thirty-five, my report indicates that my risk of colon cancer was not increased compared with the average (5 percent versus 6 percent). Does that mean I am not at risk of colon cancer and can stop having colonoscopies every five years? I wish, but unfortunately, the answer is a definite no. The scan only tests for common SNP variants, and—as we have seen—most of the heritability of common variants is grossly incomplete. One cannot make accurate individual predictive risk assessments based on the results of large populations—association doesn’t equate with prediction. That is, just because I don’t have the zip codes associated with colon cancer doesn’t mean I will not develop this condition. On the other hand, without question I have a risk of getting Alzheimer’s disease, but it isn’t clear there is anything I can do about prevention. So what can I do with this information?
FIGURE 5.5: My Navigenics genome-wide scan results. Each column pertains to the risk level of the disease in the population, ranging from less than 1 percent to greater than 50 percent. My risk for each of the twenty-five conditions is compared to the general population.

It turns out I can do many things. If I had a doubling or more risk of melanoma, I would certainly be much more concerned about sun exposure. Knowing I was at high risk of diabetes would give me extra incentive to maintain my weight and exercise maximally. For blood clots in the leg (deep-vein thrombosis), I would be much more inclined to get up and walk around on long flights and consider taking low-dose aspirin as a preventive strategy. Knowing I have a moderately increased risk of heart disease and atrial fibrillation, if I get chest discomfort with exertion or develop a very rapid heart rate, I’ve been forewarned that I am predisposed to these conditions. Instead of denying that the chest discomfort might represent lack of blood supply to my heart muscle, I would seek medical attention. Similarly, if I developed a fast and irregular heart rate, I would know what this likely signals and get the appropriate therapy.
Here is another account from a participant in our Scripps Genomic Health Initiative:
As an employee of Sempra, I participated in this study in May and received my results last week. I was amazed at the informative detail and thoroughness of the report. I was “floored” when I read my Navigenics results which found that my highest elevated risk condition (96 to 98 percent) was colon cancer.
The reason I found this so fascinating is that I have just been diagnosed with colon cancer. This was found during a routine colonoscopy two weeks ago (I am 51 years old and was supposed to have it done last year). Had I procrastinated and not had the colonoscopy procedure done this month, I would have definitely made an appointment to have the procedure done after reading my results. I am a believer in this study! I have shared my results with other family members and intend to share this report and other risk conditions with my doctor. Thank you for allowing me to participate in this important health study.
Jeff Gulcher, the chief scientific officer of DeCode Genetics, had a genome-wide scan (known as DeCode Me) performed by his company and found he had a doubled risk of prostate cancer. Although he was not yet fifty, when PSA screening might be considered, he had the screening done and learned that his PSA level was significantly elevated. Ultimately, he had a prostate biopsy, which showed cancer, and underwent a radical prostatectomy, which he believed saved his life.
106 Here’s another anecdote, which a Stanford professor emailed to me after a lecture I gave on genomics at medical grand rounds in late 2008 (I have withheld his name for privacy reasons):
Subject: My personal genomics story
I decided to volunteer for the Scripps Genomic Health Initiative after hearing Eric Topol give medical grand rounds at Stanford. My analysis was mostly reassuring, but showed two areas of increased risk. One was for prostate cancer, which was not a surprise since my father died of prostate cancer. The second was celiac disease (due to HLA-DQ2. 5 and other markers), which was a surprise. Although in retrospect, I had some subtle signs and symptoms which could be attributed to celiac disease: poor digestion of fatty foods, low serum cholesterol, LDL, and HDL, a mysterious skin rash (which might be dermatitis herpetiformis) and recurrent aphthous ulcers. So I followed-up the Navigenics report with a serologic test for celiac disease, which was positive, and an upper endoscopy, which was also positive, for moderately severe celiac disease. A dexa bone density study showed that my bone density was around 1.5 standard deviations below the mean for sex and age-matched controls, which is probably due to chronic vitamin D deficiency from celiac disease. A gastroenterologist has recommended a gluten-free diet, which is thought to be effective at reducing the many possible complications of celiac disease: diabetes, thyroid disease, liver disease, small bowel lymphoma, other cancer. In addition, I am taking calcium and Vitamin D, and my first-degree relatives are also being tested for celiac disease. It is amazing to me, at the age of 52 years, and being a physician, that my diagnosis and treatment was possible only because of your DNA test. Thank you for that.
Opinion about whether such genetic risk information should be available to anyone who seeks it has been split. Among much of the medical establishment, however, the backlash was strong.
Nature Genetics had the first shot entitled “Risky Business” in December 2007,
107 which was a quick response to the first commercial release, and by January 2008, the
New England Journal of Medicine began a relentless series of editorials and perspectives, the first of which was entitled “Letting the Genome Out of the Bottle: Will We Get Our Wish?” coauthored by the journal’s editor, Jeffrey Drazen.
108 Dr. Terry Manolio, who worked directly for many years for Francis Collins (the leader of the public Human Genome Project) at the NIH, concluded her review of GWAS with this statement: “Patients inquiring about genome wide association testing should be advised that at present the results of such testing have no value in predicting risk and are not clinically directive.”
109
On the other hand, the journal
Nature took a more measured outlook, arguing that scientists needed to figure out how to talk about personal genomics with the public.
110 By November 2008,
Nature dedicated a whole issue on personal genomics; its cover featured a human genome inside a container labeled “Break Glass When Ready” and “Your Life in Your Hands: Instructions for the Personal Genome Age.” One of its articles, “Misdirected Precaution,” pointed out that personal genome tests are “blurring the boundary between experts and lay people.” And its authors stated, “We welcome a shift from genetic protectionism to a situation in which individuals become experts on, and active governors of, their genomes.”
111
Collins has been one of the few advocates in the medical establishment for giving the public access to genomics. In a 2008 segment of “The DNA Age” series on personal genomics by Amy Harmon, discussing the slow uptake of these tests, Collins, then director of the Human Genome Research Institute and now director of the NIH, said, “It’s pretty clear that the public is afraid of taking advantage of genetic testing,” and “if that continues, the future of medicine that we would all like to see happen stands the chance of being dead on arrival.”
112 In his book
The Language of Life, Collins writes about how learning his own genome scan results changed his behavior. His risk of Type 2 diabetes was high, he said, so he lost twenty pounds and got into a regular exercise routine with a personal trainer, which might otherwise not have happened.
113
Technologists have taken an approach more like that of Collins than that of Manolio. Sergey Brin, the cofounder of Google, is married to Anne Wojcicki, who started 23andMe. Sergey’s mother developed Parkinson’s, as did his great aunt. Sergey had a genome-wide scan and found that he carried the variant in the LRRK2 (leucine-rich repeat kinase) gene, which carried a high risk—about 70 percent—of developing Parkinson’s disease. And his mother, who also had the 23andMe genome scan, had the same variant. Sergey started a blog, and his first post was titled “LRRK2.” He wrote: “I know early in life something I am substantially predisposed to. I now have the opportunity to adjust my life to reduce those odds (e.g. there is evidence that exercise may be protective against Parkinson’s). I also have the opportunity to perform and support research into this disease long before it may affect me. And, regardless of my own health, it can help my family members as well as others. . . . I feel fortunate to be in this position.”
114 His perspective is reminiscent of the view of Charles Sabine, the former NBC correspondent who learned he had the gene for Huntington’s disease and said, “Knowing my genetic condition both empowers and motivates me. . . . Nothing incentivizes more than knowing your genetic code.”
115
Other journalists have been positive as well. Amy Harmon, who later won a Pulitzer prize for her coverage, wrote a front-page article for the
New York Times in November 2007: “Learning My Genome, Learning About Myself.” Thomas Goetz, executive editor of
Wired, had a cover article, “Your Life: Decoded. A New $1,000 DNA Test Can Tell You How You’ll Live—and Die. Welcome to the Age of the Genome.”
116 Time named retail genome scanning the top invention of 2008.
117 Goetz also wrote a feature in
Wired about Sergey Brin, which reviewed Brin’s experience with personal genomics. Goetz argued that having access to DNA results can make some think that “it holds dark, implacable secrets” or “toxic knowledge.” This wasn’t the first time the medical establishment had thought such a thing, however. He reminded us that “in 1961, 90 percent of physicians wouldn’t tell their patients if they had cancer.”
118
That coverage, though, proved to be most of the few bright spots in the history of personal genomics thus far, and it’s worth recalling that genomewide scans have been under attack. Nevertheless, this form of direct-to-consumer genetic testing uses research-grade genotyping chips and should be differentiated from “snake oil” DTC genetic tests that have no scientific basis, such as analyzing one’s genes for what foods one should eat or avoid, finding out whether you are “compatible” with your lover, or predicting whether a child will become a world-class athlete. But there are many legitimate concerns that I separate into five general categories: (1) fear; (2) interpretability, utility, and errors; (3) privacy and security; (4) lack of regulation; and (5) access and cost.
To address some of these issues for which there was no or limited anecdotal data, the Scripps Genomic Health Initiative performed a large study of more than 3,600 individuals, who underwent the Navigenics genomewide scan at a markedly discounted charge ($200 instead of $2,500). The study assessed the impact on diet and exercise, psychological effects, and medical screening and diagnoses. In follow up at approximately six months, for more than 2,000 participants who submitted their data, there was no evidence, as assessed in well-validated tests of psychological impact, of depression, distress, or heightened anxiety. Unhappily, there was also no clear evidence of lifestyle improvement or of undergoing appropriate screening tests related to conditions of risk, but there was strong support for intent to have such tests. For individuals with an increased risk of colon cancer there was a significant increase in intent to undergo colonoscopy, as well as a similar increase in intent to undergo mammography for breast cancer risk, PSA testing for prostate cancer risk, and eye examination for those at heightened risk for macular degeneration.
119
The colonoscopy finding was interesting. My wife, who participated in the study, found she had a doubled risk of colon cancer. But at age fifty-five, five years past when she should have undergone her first colonoscopy, she had previously avoided the procedure. She underwent it and was relieved to find out it was normal. At least two participants in the study similarly had their first colonoscopy and had polyps removed; one also was found to have cancer and felt gratified that it was detected early. Of note, more than half of the population over age fifty does not undergo colonoscopy as recommended. But this alternative to mass screening, directed by increased risk of common gene variants, may someday prove to be a worthwhile means of directed, individualized screening, particularly when more of the heritability for each condition can be precisely defined.
The overall findings of the study were both sobering and provided some reassurance. Changing behavior to improve lifestyle—to lose weight, exercise more, and eat healthy foods—is one of the greatest challenges in health care, and there have not been any major success strategies to date. Unlike with Collins and Brin, results of a genome-wide scan did not induce salutary lifestyle changes in a large population.
120 On the other hand, the “toxic knowledge” concern could be put aside. There was solid evidence pointing away from any detrimental psychological impact. This goes along with an important study on the apoε4 allele in families with Alzheimer’s disease, which concluded, “The disclosure of apoε genotyping results to adult children of patients with Alzheimer’s disease did not result in significant short-tem psychological risks.”
121 Our study, which included apoε genotyping along with over twenty other conditions, yielded similar findings. That doesn’t mean that everyone can deal equally well with his or her DNA results, but it strongly suggests that willing participants are well equipped to do so. It certainly counters the assertion of the officials of California’s Department of Public Health that these tests were “scaring a lot of people to death.”
122
The questions about utility are more grounded. From an article by my colleagues in
Nature in 2009, it became clear that the results across the three companies could be conflicting.
123 The main reason had nothing to do with the genotyping, which was remarkably accurate, but with the lack of consensus on which studies to use and how the calculations were made for reporting the risk of the various conditions. There were no standards for what level of evidence was necessary to incorporate a genetic marker or what magnitude to assign its effect. An individual who has tests from all three companies might have an increased risk for a disease in one and protection from the same disease in another. I have had all three of these genome scans, along with a fourth test by Pathway Genomics, and have seen some of these discrepancies myself. By 2010, there were demands resonating from both government regulators and the science community for developing uniform standards for reporting.
124
The biggest vulnerability of present-day genome-wide scans cannot be avoided: as GWAS peeks at the genome, most of the risk will be missed. Hopes are being pinned on whole-genome sequencing to be the means of providing the critical missing information.
Some individuals have had whole-genome sequencing and written about the experience, but the revelations have been far from striking. Craig Venter in 2008 published “A Life Decoded: My Genome, My Life,” which contained tidbits of his genomic sequence data, such as an elevated risk of heart attack from a gene variant that results in slow metabolism of caffeine.
125
In 2009, Steven Pinker, a highly regarded Harvard psychologist, published “My Genome, My Self” in the
New York Times Magazine in which he pointed out, from exome sequencing, that he had a genetic predisposition to baldness but couldn’t have more hair on his head, and already knew he had one copy of a rare variant for a serious disease known as familial autonomic dysautonomia. Pinker’s reflection on his apoε allele was noteworthy. He knew that James Watson, in his own scan, had omitted that sequence data when published in 2008 in
Science. As Pinker put it: “All of us already live with the knowledge that we have the fatal genetic condition called mortality, and most of us cope using some combination of denial, resignation and religion. Still, I figured that my current burden of existential dread is just about right, so I followed Watson’s lead and asked for a line-item veto of my APOε gene information.” As an aside, Pinker is an advisor to Counsyl, a company with a genomic test for screening couples who are planning to conceive a baby or have a history of infertility or multiple miscarriages. Counsyl screens one hundred rare recessive mutations for carrier state for $350. Pinker, with the rare mutation for familial autonomic dysautonomia, had his wife screened; she carried this allele too. He said, “We met too late in life to have children, but if we had met a few years earlier we would have been playing roulette.”
126 More recently, with next-generation sequencing it has become technically feasible to screen more than 500 recessive conditions (out of a known 1, 139 recessive Mendelian traits), which has the potential to supersede Counsyl’s more limited platform for screening couples who are planning to conceive a child.
127
Esther Dyson, the venture capitalist and one of the first ten participants in the Personal Genome Project, wrote a
Wall Street Journal op-ed entitled “Full Disclosure” in which she explained why she’s posting her genome and medical records on the Internet. She quoted a colleague: “You would no more take a drug without knowing the relevant data from your genome than you would get a blood transfusion without knowing your blood type.”
128 Misha Angrist, a genetics researcher at Duke and former genetic counselor, published a book in 2010,
Here Is a Human Being, about his experience in participating in the same Personal Genome Project as Pinker and Dyson and getting his whole-genome sequence data.
129 While long on the adventure and checkered with humor, it was notably short on meaningful genomic data. Glenn Close presented some of her sequence results at the Society for Neuroscience’s annual meeting in San Diego in late 2010. She has a history of mental illness in her family that she vaguely described as a “neuroscience family” and had other family members contribute to the presentation.
130 Even though she was the first identified woman (rather than an anonymous one) whose diploid genome was sequenced, there were no specific gene findings to note. Even the notorious Ozzy Osbourne had his genome sequenced; the data were presented at the TEDMED 2010 meeting. Surprisingly, it had no insight to offer on his long history of substance abuse.
131 While these are just a sampling of individuals in the nascent WGS phase, the results show us that a lot more work needs to be done to make the six billion life codes medically fruitful.
Here I’ll return to the controversy over genome-wide scans to offer some predictions of what it will be like when WGS is widely available. Even though participants in the Personal Genome Project were required to have their sequence data posted online, the issue of privacy is a central concern. Some of the worries about employer or health insurer misuse of the results were diminished by passage of the Genetics Information Nondiscrimination Act (GINA) in 2008.
132 The law is already seeing action in court: The first lawsuit under GINA was filed in 2010 by a woman who sued her employer for terminating her job after she had confirmed very high risk of breast cancer with the BRCA2 mutation and underwent a double mastectomy. The law has clear limits, however. Insurers can still request genetic information before covering procedures, and while GINA litigation aims to prevent abuse of genetic data by health insurers or employers, it does not protect against inappropriate handling by life insurers or long-term disability insurers and does not apply to military coverage or to companies with fewer than fifty employees.
133 These residual issues may and should be addressed in the future, but legitimate concerns about privacy of life code data remain.
Further concerns about privacy are indexed to the potential of misuse of the data by the consumer genomics companies. For example, what if a consumer genomics company marketed the data to pharmaceutical companies? What happens to the data if the consumer genomics company goes bankrupt, which is not a far-fetched scenario? A de-identified, anonymized genome-wide scan can still be used to identify a particular person. So there should be protection for privacy in any database that contains genome-wide scan data, let alone WGS data. Nevertheless, personal genomic companies have performed high quality research with their database and been able to directly connect to a large number of individuals with a particular condition, such as through the discovery of novel gene variants for Parkinson’s disease by 23andMe.
134
Government regulation of consumer genomics companies has been a centerpiece (and the semblance of a circus) in their short history. Back in 2008, the states of California and New York sent “cease and desist” letters to the genome scan companies.
135 State officials were concerned that the laboratories that generated the results were not certified as CLIA (Clinical Laboratory Improvement Amendments) and that the tests were being performed without a physician’s order All three companies developed workaround plans in California and remained operational but were unable to market the tests in New York.
136
In 2010, the regulation issues escalated to the federal level. In May it was announced that 7,500 Walgreens drugstores throughout the United States would soon sell Pathway Genomics’s saliva kit for disease susceptibility and pharmacogenomics.
137 While the tests produced by all four companies had been widely available via the Internet for three years, the announcement of wide-scale availability in drugstores (which was cancelled by Walgreens within two days) appeared to “cross the line” and set off a cascade of investigations and hearings by the FDA, the Government Accountability Office (GAO), and the Congressional House Committee on Energy and Commerce. The FDA’s Alberto Gutierrez said, “We don’t think physicians are going to be able to interpret the results,” and “genetic tests are medical devices and must be regulated.”
138 The GAO undertook a “sting” operation with its staff posing as consumers who bought genetic tests and detailed significant inconsistencies, misleading test results, and deceptive marketing practices in its report.
139
All four personal genomics companies are struggling. In the cumulative four years since the first commercial launch, about 100,000 consumers have bought the tests.
140 The prices have dropped dramatically, to between $200 to $400, and the scope of the scans has narrowed to approximately 50,000 to 100,000 SNPs (down from 500,000 to 1 million SNPs) to save costs. The price is clearly a significant factor for consumers. During our Scripps Genomic Health Initiative study, the Sempra energy company in San Diego offered to defray the costs for participation for 1,000 of its employees. There was a veritable stampede of Sempra staff to put their saliva in cups! This correlates to the response I get when I ask groups of lecture attendees, “How many would get a genome scan if it was free?” More than 90 percent typically raise their hands.
Although these companies have been under siege and may not survive, several important impacts have forever changed the landscape of genomics for consumers. One is the realization that there will likely never be a “right time”—after we have passed some imaginary tipping point giving us critical, highly actionable, and perfectly accurate information—for it to be available to the public. The logical conclusion is that the tests should be made available. What’s more, the fact that they have been available has meant that the democratization of DNA is real.
141 Consumers now realize that they have the right to obtain data on their DNA. As a blogger wrote in response to the Walgreens flap, “To say that this information has to be routed through your doctor is a little like the Middle Ages, when only priests were allowed (or able) to read the Bible. Gutenberg came along with the printing press even though few people were able to read. This triggered a literacy/literature spiral that had incredible benefits for civilization, even if it reduced the power of the priestly class.”
142
The American Medical Association (AMA) sees things differently. In a pointed letter to the FDA in 2011, the AMA wrote: “We urge the Panel . . . that genetic testing, except under the most limited circumstances, should be carried out under the personal supervision of a qualified health professional.”
143 The FDA has indicated it is likely to accept the AMA recommendations, which will clearly limit consumer direct access to their DNA information. But this arrangement ultimately appears untenable, and eventually there will need to be full democratization of DNA for medicine to be transformed. Of course, health professionals can be consulted as needed, but it is the individual who should have the decision authority and capacity to drive the process.
The physician and entrepreneur Hugh Rienhoff, who has spent years attempting to decipher his daughter’s unexplained cardiovascular genetic defect and formed the online community
MyDaughtersDNA.org, had this to say: “Doctors are not going to drive genetics into clinical practice. It’s going to be consumers.... The user interface, whether software or whatever, will be embraced first by consumers, so it has to be pitched at that level, and that’s about the level doctors are at. Cardiologists do not know dog shit about genetics.”
144
As a cardiologist I would tend to agree with Hugh’s point here, although I would not express it in these graphic terms. As I discuss in Chapter 9, physicians are not prepared for genomic medicine. In my 2007
Wall Street Journal op-ed, “What You Can Learn from a Gene Scan,” shortly after consumer genomics became a reality, I forecasted that “when a consumer arrives in his or her doctor’s office to get help in interpreting the genomic data, the doctor is likely to respond, ‘What’s a SNP?’”
145 Four years later that hasn’t changed, and we have only 1,500 medical geneticists in the United States and fewer than 2,000 certified genetic counselors in this country for 310 million people.
146
The necessary and appropriate democratization of DNA data extends beyond what the companies willingly supply. Each individual has the right to request his entire data set, particularly worthwhile when scans include 500,000 to 1 million SNPs. One can look up any SNP with data on the SNPedia website or run it through George Church’s downloadable software, known as Trait-o-matic, to yield considerable information on conditions that are not generally reported. In 2011, the iPad app Genome Wowser became available to graphically review the findings of any individual’s genomic variants. Being able to explore one’s genome by manipulating the tablet screen is a particularly dynamic experience.
One other contribution of the consumer genomics companies may be the “sweet spot”—pharmacogenomics. Remarkably, the data for all four companies were completely concordant across many drugs—Plavix, warfarin, statins, and many more. Pathway Genomics initiated a pharmacogenomics panel for $79 as part of the drugstore roll out, and all four companies have now incorporated such genotyping. Having had all four tests, I learned that I have a marked sensitivity to warfarin and a very high risk of developing bone-marrow suppression from the anti-cancer drug Irinotecan (Camptosar). I also learned that I am a poor metabolizer of caffeine; my risk of heart attacks would rise if I drank more than three to four cups of coffee per day.147 This information might be considered a bargain, since any single drug genotyping costs over $200 when processed by national labs like LabCorp or Quest Diagnostics. When I ask large groups, “How many would like to have their pharmacogenomics panel?” everyone raises his or her hand. Someday we will all have this information; it will be fairly comprehensive and markedly improve the precision of safety and efficacy for prescription medications.
That someday will likely involve whole-genome sequencing, given its inexorable march to becoming affordable and the software that will inevitably be developed to provide exceptional annotation, with constant updating. The first decade since the human genome sequence was drafted will, in retrospect, be viewed as a long warm-up to making a difference in day-to-day medical practice. Already we can see the impact that sequencing can make in cancer treatment, definition of individual gene-drug interactions, and unraveling idiopathic conditions. The foundation for genomic medicine has been laid. The revolution is ongoing: even though it has taken longer than initially projected, we are moving irrevocably forward in the second postsequence decade. Routine molecular biologic digitization of humankind is just around the corner.