Chapter Three

From Primordial Soup to Personalized Medicine

The building blocks of life on earth began in the hot reactive stew of hydrogen, carbon, oxygen and nitrogen that defined our planet roughly three and a half billion years ago. That was a mere ten billion years after the universe was formed by the Big Bang, and one billion years after our planet had formed. It is there that the rules governing genetics and personalized medicine spawned.

Yet it was only during the last one hundred years that scientists concluded that DNA was the source of genetic inheritance.

Genes and DNA

Think of genes as library books; stacked on shelves (the chromosomes); in a library (the nucleus) of a city (the cell). All free living organisms contain genes, from bacteria to the simplest algae and plants, right up the vegetable and animal kingdoms to us, homo sapiens, at the self-proclaimed top of the evolutionary tree. Even viruses, which are not really alive and can only reproduce and manufacture their proteins inside the host cell, require at least a few genes composed of either DNA or RNA.

So what exactly is DNA, deoxyribonucleic acid? It’s a polymer, which is a long chain of connected units like the teeth of a zipper, or a string of beads on a necklace. DNA was discovered by Friedrich Miescher, a Swiss physician, in the 1860s. He obtained cells from the pus from discarded surgical bandages, and isolated their nuclei and the material within. He called this nuclear material nucleic acid. Each unit of DNA contains a sugar molecule called deoxyribose. Deoxyribose consists of five atoms. Four carbon atoms and one oxygen in a ring which bind one of the four nucleic acid bases, adenine (A), guanine (G), cytosine (C) or thymine (T). Each [sugar + base unit] is called a nucleoside. The nucleosides are joined to each other by a link, a phosphate group, forming long threads of hundreds of thousands of subunits. These subunits, [sugar + base + phosphate link], are called nucleotides.

RNA is essentially the same as DNA except that it has another oxygen atom included in each ribose sugar molecule. So a DNA sugar unit is actually an RNA sugar unit without one oxygen, which is why it has the name “deoxy-ribose.”

Miescher thought nucleic acid could only have a simple and boring function, since DNA molecules are composed of long strings of nucleotides that are almost identical apart from having one of four possible bases attached. He thought that it might serve as a kind of scaffolding that just sat in the nucleus and supported more “interesting” molecules such as proteins. Proteins in turn are made up of strings of amino acids, of which there are twenty-two different varieties.

Nothing could be further from the truth!

On each ribose, or deoxyribose, unit another chemical is added. In the case of DNA these are one of the four bases: adenosine (A), guanine (G), cytosine (C) and thymine (T). When James Watson and Francis Crick elucidated the double helix crystal structure of DNA in 1953, they found that these four bases “partner up.”

The Unknown Heroine Behind Watson and Crick

Rosalind Franklin (1920 – 1958) was a British biophysicist, and graduate of Cambridge University, who used X-rays while working at the Medical Research Council Biophysics Unit at King’s College, London to define crystal structures. It was her data on the structure of DNA, and in particular photograph 51, that Watson and Crick relied on, without her knowledge, or much recognition, to work out the structure of DNA. Her articles, which outlined the structure of DNA based on her own work, were eventually published in the same issue of Nature as Watson and Crick’s better known paper. Franklin, Crick, Watson and Maurice Wilkins were keen to beat Linus Pauling to establish the actual structure of DNA after Pauling had proposed an incorrect structure for DNA.

When the four bases “partner up”, they are bound, weakly, by hydrogen bonds: A binds to T and vice versa; C binds to G, and again vice versa. Adenine and guanine are examples of purine bases, and cytosine and thymine are pyrimidines (Figure 3.1a). In DNA, a long strand of the linked deoxyribose sugars bearing one of the four bases on each sugar molecule, now called deoxyribonucleic acid, is “zipped up” by pairing with a complementary chain of nucleotides (Figure 3.1b) on the other DNA strand. These long paired chains of complementary nucleotides (i.e. double strands) are then wound into a helix (Figure 3.1c), folded and balled up, to make up genes, which in turn are linked together on chromosomes.

images

Figure 3.1a. Showing how the four bases pair up by hydrogen bonding. The pyrimidine base Thymine (T) partners with the purine base, Adenine (A), and the pyrimidine base, Cytosine (C) partners up with the purine base Guanine (G). These bases are then attached to the deoxyribose sugar rings of DNA or the ribose sugar rings of RNA. The sugars are joined together by phosphate linkages to form phosphate-deoxyribose or phosphate-ribose backbones accordingly (Figure 3.1b). Each deoxyribose plus base plus phosphate unit is called a nucleotide. The nucleotides are joined together into very long sequences, often over a million units long. In the case of DNA these then line up against their complementary partners to form a double strand, and are then wound into the characteristic double helix formation (Figure 3.1c)

images

Figure 3.1b. The phosphate-deoxyribose backbone of DNA.

images

Figure 3.1c. The characteristic double strand formation of DNA, wound into a double helix, which is then further folded and twisted.

DNA is a great way to store and transmit information, in this case, genetic information. But more than that, scientists have even been able to store information in strands of DNA that you or I would think more likely belong on a computer. Recently, a team led by Professor George Church at Harvard, reported in the magazine, Science, that they had stored a 53,000 word book, including eleven images and a computer program (5.27 megabits of information), in a strand of DNA. The scientists went on to suggest that as the technology improves we might see DNA replacing conventional computers. One gram of DNA, about the weight of a pinch of salt, can store 455 billion gigabytes of information. That is the same amount of information that today would be stored on one hundred billion DVDs!

There are advantages of storing information in DNA, as all species on this planet have found. It is easily copied and passed on to the next generation. It can still be read thousands of years later, as we are now doing with the DNA from Egyptian mummies. And the way to read DNA does not change, unlike the rapidly changing digital storage media that make today’s information technology breakthroughs tomorrow’s relics for the museum or recycling.

Chromosomes

We humans have 46 chromosomes, consisting of 22 pairs and two sex chromosomes (XX in the case of women and XY in the case of men) (Figure 3.2). On these chromosomes, we have approximately 25,000 genes in total.

images

Figure 3.2. The 22 pairs and two sex chromosomes (in this case from a man – with the characteristic, short Y sex chromosome) of a human.

Most of our genes regulate the production of a protein of some description, with different genes in different parts of the body being more active than elsewhere. Think of it as the information, the blueprint, in the books (genes) being taken out of the library and sent to a house or factory (the ribosome). The ribosomes are small organelles within the same city, the cell, where this information can then be read, understood and used to produce something – in this case a protein. The ribosome is a protein-making machine in the outer part of the cell.

Each and every nucleated cell in the body has the same library, with the same 25,000 books, the genes. But different types of cells will be programmed to produce different proteins by taking the information from different books in the library. The book, in this case the gene, does not leave the library, the nucleus, but the information, the blueprint it contains, is conveyed to the ribosome factory in the cell cytoplasm by a messenger, a copy of the information. That copy is messenger RNA, mRNA.

RNA

Messenger RNA is cleverly made. When the double strand of DNA in the nucleus unzips, new sequences of complementary material bind to the exposed single strand of DNA and are then peeled off (transcribed) as precursor messenger RNA (pre-mRNA) (Figure 3.3). In the process, RNA incorporates the pyrimidine base Uracil (U) instead of the Thymine that is found in DNA. The formed RNA units [ribose+base+phosphate linkage] are also called nucleotides.

However, there is much information within these DNA strands that is currently considered unnecessary. Think of the “book” (the gene) being composed of many chapters. In the case of the largest gene in the body, that for dystrophin, the gene consists of 2,400,000,000 nucleotide units. These nucleotides are split up into “sections” (chapters) of exons and introns. For dystrophin, there are 79 exons (accounting for 0.6% of the gene’s sequence) and 78 introns (accounting for the remaining 99.4% of the 2.4 billion letters of the gene). Only the exons are needed for the information they contain to be translated into the dystrophin protein by the ribosome. Most of dystrophin’s exons are relatively small at between 23 and 269 nucleotides in length, compared to the average size of the dystrophin introns, which are over 26,000 nucleotides long.

All the information required by the muscle cell for manufacturing the final protein, dystrophin, is contained in the 79 exons – just 0.6% of the dystrophin gene.

images

Figure 3.3. Simple diagram of transcription initiation. RNA polymerase is an enzyme that unzips the double strand of DNA, and moves down the gene (black coding thread).

images

As the double strand “unzips” the polymerase assembles complementary RNA units which gradually lengthen as the gene unzips, and then zips back up again.

images

Eventually, the gene has generated a full complementary RNA strand and transcription terminates. The new pre messenger RNA is completed and released. The DNA double strand “zips” back up, and the RNA polymerase is free to work on another gene.

The introns, by and large, carry information that is not needed by the factory in the cell cytoplasm, so they can be removed as the pre-mRNA is tidied up and converted into useful mRNA. We now know that this process of “tidying up”, which is called splicing, is a complex process, performed in the nucleus, and may in fact lead to different, “alternative”, blueprints being sent to the different cellular factories.

The Encyclopedia of DNA Elements (ENCODE) consortium of 442 researchers, launched by the U.S. National Human Genome Research Institute in 2003, has been examining the 99% of DNA that make up the introns. In September 2012, the scientists published 30 papers simultaneously in three major scientific journals, a remarkable feat on its own. In those papers many common diseases such as diabetes, obesity, heart disease, cancer and Alzheimer’s dementia, the ones that are not caused by identified single genes, are now thought to be caused by genes being switched on or off by the information contained in the introns. But the story is only just beginning to be unraveled. We have a long way to go to fully understand what 99% of our DNA does, and how. A whole new branch of medicine and range of possible new drugs is now emerging that will target these switches that may trigger a liver cell to stop absorbing sugar and so diabetes develops; or the switch which once thrown in a lung cell will turn it into a cancer cell.

It used to be thought that one gene codes for one protein, but we now know that there are roughly 25,000 genes but approximately 150,000 proteins. Thus each gene on average contains the blueprint for six proteins. How?

The process of tidying up can sometimes lead to the message (in mRNA) being put together in different ways. Imagine that the book in the library is taken off the shelf and the cover and binding are removed, the non-essential information (the introns) is discarded (more about that later) and then a copy of the necessary information is bound into a much, much slimmer book. Thus the book copy (mRNA) that will leave the library (nucleus) only contains important instructions so that when it reaches the factory (the ribosome in the cell cytoplasm), the various blueprints are still recognized and can still be used to build the finished products (the different proteins). The introns that have been discarded contained no useful information at all, or so it was thought, but all of the exons do contain useful information. However sometimes one exon, or more, is missing from the final mRNA that makes it to the ribosome and some of the useful instruction is missing. A different final protein product can still sometimes be made and may work for the cell, and the body.

If the “book copy” (mRNA) was the blueprint for a car (protein), then consider that there would be chapters (exons) on how many wheels it had as well as a chapter on how many seats it had. It would have a chapter on the interior finish, another on the size of the engine, what color it should be, where and how the alarm is set up, etc. All of these chapters would be necessary to produce the car as sold in the showroom functioning to the manufacturer’s specifications. However, if the chapter on the alarm was missing, the car would still work; it just wouldn’t have an alarm. Or the car might be a different color, or lack one or more coats of paint. The car might rust quicker, but it would still be drivable.

Thus 25,000 genes (books in the library), may each generate many different molecules of mRNA (copies of thinned down books consisting of important information - exons). On average each gene in our bodies generates six different mRNA molecules that leave the library (nucleus) and go to the factory (ribosome), where they will generate six times the number of proteins compared to the number of original genes. This process, first described in 1977, is called alternative splicing.

Alternative splicing

We now know that 95% of our genes, those that are composed of multiple exons, are alternatively spliced. This happens across all eukaryote species. Eukaryote species are plants and animals where the cell has a nucleus and other discrete structures, as opposed to prokaryotes, bacteria, where the cells do not have these discrete structures. The current record for most mRNA variants being generated from a single gene comes from the fruit fly, Drosophila melanogaster. The humble fruit fly has a gene, Dscam, which generates more than 38,000 splice variants.

The fruit fly has long been used to study genetics. In this interesting insect, alternative splicing is required to produce offspring of each gender. Pre-mRNAs from the D. melanogaster gene dsx contain 6 exons. In males, exons 1, 2, 3, 5, and 6 (skipping exon 4) are joined to form the mRNA, which codes for a regulatory protein required for male development. In females, exons 1, 2, 3, and 4 are joined (leaving out exons 5 and 6). The resulting mRNA is a regulatory protein required for female development. (Also see Figure 4.1).

Continuing the analogy of the book (or blueprint) ending up at the factory (ribosome) out in the cell cytoplasm (the city). The book is composed of words. Each word is composed of three letters but the words all make sense. Each letter is a nucleotide and each word (of three nucleotides) is a codon. Codons are bundled together – sometimes only a few, other times thousands, to form chapters in the book. Each chapter is an exon. The blueprint makes sense if the words in a chapter make sense. However, sometimes the book is damaged and a chapter, several words, or even a single letter in a word, is missing. The process of tidying up the book into a manageable copy “splices” the chapters of useful information (exons) together and removes the non-essential information (introns), by arranging every word (codon) with three letters (nucleotides) into sentences. The mRNA strand (the copy of the blueprint) can then be translated by the factory (the ribosome) into a protein.

However, if a single letter is missing, during the transcription process a letter is taken from the next word to complete the previous word. The factory can only read words in sequence. Each word has to be of three letters during this “splicing” for the sentence (and hence the chapter, and thus the book) to make sense. If the words are spliced incorrectly, the book cannot be “read”, and the factory (ribosome) will not be able to make the protein.

Consider this example:

images the big red fox ran far and saw the dog and cat hit the man

This information can be read and understood, but if the “g” is missed in big, then “r” is used from red to complete the codon, frame shift occurs and the sentence no longer makes any sense:

images the bir edf oxr anf ara nds awt hed oga ndc ath itt hem an

In practice, the factory will start off reading the message. It will be able to follow the instructions for “the,” but none of the other codons will make sense. So the rest of the sentence will not be interpreted, or worse, may generate bad proteins. This situation occurs in some, perhaps many, rare genetically determined diseases. It is estimated that 70% of neurological diseases are associated with some form of upset in the regulation of alternative splicing.

There are 64 possible codons of three letters generated by the four bases A, C, G and T. Any one of the four bases can be placed in any of the three positions in the codon, thus the number of possible options is 4 x 4 x 4, which equals 64.

Protein is composed of chains of amino acids. There are 21 amino acids of which twenty are coded for by one or more of the 64 individual possible codons. That is why it is vital for the three letters in a codon to be faithfully reproduced. If a different sequence of three letters is made, either no amino acid will be coded for, or the wrong one is included. This may turn the protein from a “good” protein into a “bad” protein as in the case of sickle cell anemia, where a single nucleotide mutation leads to the hydrophobic (water hating) amino acid valine replacing the hydrophilic (water loving) amino acid glutamic acid on one of the hemoglobin protein chains.

Sickle Cell Anemia

Sickle cell anemia is prevalent in the tropics and especially in sub-Saharan Africa where roughly one third of indigenous people carry the recessive gene. In Africa, carrying one copy of the gene (as it is recessive, the disease only occurs if both parents contribute one copy of the sickle gene), has over the centuries provided some protection against malaria, providing a compensatory benefit. This survival advantage has allowed this otherwise unhealthy trait to remain widely prevalent in Africa where malaria too is prevalent.

Those unlucky enough to be homozygous and to have two copies of the sickle gene find their red blood cells become rigid and distorted under the microscope, acquiring the characteristic sickle shape, and they get stuck in the small capillaries robbing downstream tissue of oxygen getting destroyed in the process, leading to anemia.

Healthy red blood cells generally survive 90 to 120 days before being replaced, but sickle cells only last 10 to 20 days, and the bone marrow cannot keep pace with this rapid turnover. In 1956, Vernon Ingram (1924 – 2006), a professor of biology at the Massachusetts Institute of Technology, along with John Hunt and Antony Stretton, first reported that sickle cell anemia was caused by a single amino acid substitution in the hemoglobin molecule. This led to Ingram being credited as the “Father of Molecular Medicine”.

Summary

The city (cell) has a library (nucleus) containing 46 shelves (chromosomes) of books (genes) made of original paper (DNA). Each book (gene) in the library contains many chapters of useful information (exons) and unnecessary information (introns – or at least we currently believe that most introns do not provide useful information). Each chapter (exon) is composed of many words (codons) of three letters (nucleotides). The factory in the cell does not need the original book from the library – just a copy of the useful information (contained in the exons) from the book. At the shelves, the book is taken to pieces and the original codons of 3 letters (A, T, C, and G) matched up with complementary letters (U, A, G and C) of a growing strand of precursor mRNA. This whole copy of the book, as pre-mRNA, is sent to be rebound (“spliced”) by the librarian (the “spliceosome”), where the non-essential information is discarded and the remaining chapters are “rebound” as the mature mRNA, which then leaves the library and heads to the factory. I have tried to simplify this complex process while describing, with some license, the main steps. Each year, more is understood about the process, the various signaling and regulating chemicals and even more questions arise.

How does the spliceosome form and why? Somewhere, someone is beavering away on that very question.

What controls the rate at which these processes progress, the rebinding for instance?

What controls the rate at which the books are taken off the shelves?

How is the speed at which the newly bound books leave the library regulated?

Intense investigation is trying to reveal the mechanisms that perform these important controls.

We do know now that there are other types of RNA, microRNA and regulatory RNA, which seem to control some of these steps. The introns, previously thought to contain no vital information, may be critical and involved in these processes. As science phrases the important questions and the answers are discovered, correcting misconceptions, new questions will be raised.

Right now, we are already beginning to develop synthetic oligomers which will camouflage small pieces of the messenger RNA in some cases triggering alternative splicing, and thus giving us medicines to defy some of our previously untreatable diseases.