CHAPTER 4.
From Genes to Phenotype
AS WE SAW IN CHAPTER 1, DNA is a very long, thin molecule. This molecule contains the genes that determine what an organism is and does. Since the only thing that varies among DNA molecules is the sequence of base pairs, not the sugar-phosphate backbone, the “program” encoded by DNA must reside in the order of base pairs. And indeed, it is helpful to think of DNA as an old-fashioned ribbon-like computer tape (now replaced by CDs) containing the software of the cell, this software being the base sequence. As in any computer, the software can only be executed if the proper hardware decodes the instructions present in that software. This is in fact how the genes present in DNA carry out their commands: the language of DNA base pairs is interpreted and executed by the cell’s hardware. The decoding of DNA takes place in two separate steps, called transcription and translation, respectively.
Proteins are the end products of the flow of genetic information in living cells. They directly determine how an organism looks and acts. The function of transcription is to produce a ribonucleic acid (RNA) copy of the DNA, while the function of translation is to decode this RNA copy and direct the production of proteins. Proteins can be thought of as agents that determine the phenotype of cells and organisms, while the genotype is stored in DNA.
image
Figure 4.1 The Structures of DNA and RNA Contrasted. DNA is a double-stranded molecule with deoxyribose as the sugar, while RNA is a single-stranded molecule with ribose as the sugar. The thymine in DNA is replaced by a similar structure, uracil, in RNA.
Transcription
In all cells, the genetic message encoded in the DNA is first transcribed into a faithful RNA copy of the DNA molecule. The two key differences between DNA and RNA molecules are that the sugar in RNA is ribose instead of deoxyribose, and the base uracil replaces thymine (figure 4.1). In addition, RNA molecules are not double helices like DNA. We say that RNA is a single-stranded molecule, not a double helix (or a double-stranded molecule) like DNA. However, the base sequence found in messenger RNA (mRNA), as RNA copies of DNA are known, has the exact same base sequence as that found in DNA, with U replacing T (figure 4.2).
image
Figure 4.2 DNA and its mRNA Copy. A short piece of double-stranded DNA, represented by letters, and its mRNA copy.
The hardware that copies genes present in the DNA is called RNA polymerase. This enzyme reads the base sequence of DNA and builds an RNA copy of each gene in the DNA molecule. To do this, RNA polymerase binds tightly to DNA, moves along it, and makes an RNA copy using RNA building blocks in the cell. This process is like copying DNA, except that only one of the two DNA strands is copied, thus the RNA product is released as a single-stranded RNA molecule (figure 4.3). Also, each gene is copied separately by RNA polymerase. For this, the base sequence of each gene has at the beginning what is called a “promoter region.” This is the part of the DNA that RNA polymerase recognizes as the “start” signal for a gene. Think about a promoter as being a DNA base-pair sequence that tells RNA polymerase to “begin here.” Transcription of the gene can then proceed. When RNA polymerase reaches the end of the gene, a terminator region, also made of a special base sequence, tells RNA polymerase to “stop transcribing here.” The RNA produced is called messenger RNA (or mRNA). RNA polymerase then falls off the DNA and is ready to perform another round of transcription (figure 4.4).
Thus, a cell contains a large collection of different messenger RNA molecules, each a copy of a particular gene. Bacteria typically harbor a few thousand genes, while humans have approximately 35,000. This means that bacterial cells can contain a few thousand different messenger RNA molecules, while human cells can contain tens of thousands. There may be just a few messenger RNA for one gene and hundreds of messenger RNAs for another gene. The lengths of messenger RNA molecules also vary from about 1,000 bases in bacteria to tens of thousands of bases in humans. This is because bacterial genes are in general much shorter than human genes.
image
Figure 4.3 The Beginning of Transcription. A diagram illustrating how transcription starts with the DNA double strand opened up and single-stranded mRNA copied from one of the DNA strands. The lighter-shaded line represents the sugar-phosphate backbones of the DNA and RNA. The rectangular blocks hanging from these backbones represent the bases. The bases of mRNA are complementary to the DNA bases from which they are being copied.
Translation
Messenger RNA molecules are an intermediate step in the decoding of genes; they must next be translated into protein products. The machinery that translates mRNAs is quite complicated, and it took many years to unravel the mystery of their functioning. Proteins are polymers of twenty different amino acids arranged in a particular sequence. This is true of proteins from viruses to human beings. Hundreds of thousands of different proteins exist in the living world. They differ by the arrangement and length of amino acids (table 4.1). The central question then is this: how does the translation machinery “know” where to insert a given amino acid in a growing protein molecule and how does it “know” where to start and where to stop the synthesis of a particular protein? This is done by the cellular decoding machinery that translates the sequence of nucleotides in mRNA into a sequence of amino acids.
Each amino acid is coded for in DNA by a sequence of three adjacent base pairs. A set of three contiguous base pairs is called a “codon.” For example, ATG determines the amino acid methionine. Most amino acids are coded for by several codons: for example, both AAA and AAG code for lysine. Some amino acids are coded for by as many as six different codons. Interestingly, three codons, TAA, TAG, and TGA do not correspond to any amino acids; they mean “stop” and instruct the translation machinery to stop making a protein molecule. You may already have guessed that stop codons are found at the end of a gene. In addition, there is a single “start” codon, ATG, that, we just saw, codes for methionine. Does this mean that all proteins start with the amino acid methionine? The answer is yes, with very few exceptions. However, many proteins are processed further after they are made, and, sometimes, this process involves removing some amino acids at the beginning of the protein. This is the case for both hemoglobin and insulin listed in table 4.1.
image
Figure 4.4 Transcription. A. The gene in the DNA has a region known as the promoter that signals the RNA polymerase to bind to the DNA and begin making RNA using nucleotide subunits. DNA is diagrammed as a double helix, with the strand that is read printed darker than the opposite strand. The gene also has a terminator region that signals the end of the gene. B. The process of transcription. In Step 1, RNA polymerase binds to the promoter. Step 2 shows the initiation of mRNA production. Step 3 shows a long piece of mRNA, attached to the RNA polymerase, towards the end of the process. Transcription ends in Step 4 when the RNA polymerase finds the terminator and falls off the DNA.
Table 4.1 Examples of Proteins and Their Amino Acid Compositions
image
Compositions are for:
Human β-hemoglobin, the protein responsible for carrying oxygen in our blood. This is the protein that is mutated in sickle-cell anemia. DNA polymerase from Thermus aquaticus used in PCR reactions. Human insulin, the protein necessary for regulating blood sugar. Human fibrillin, a connective tissue protein, which if mutated can cause Marfan’s syndrome.
Rubisco (ribulose bisphosphate carboxylase), which is the most abundant protein in the world and necessary for all of life. It is present in all plants and is used to convert the energy from the sun to fix carbon; that is, to make carbon available to nonplants. This is the composition of rubisco from sugar maple.
image
Figure 4.5 Converting a Gene to a Phenotype. A. A short stretch of double-stranded DNA. B. The corresponding mRNA. C. The corresponding protein made using this gene.
Figure 4.5 shows a hypothetical very short gene (the DNA double helix), its mRNA copy (where U replaces T), and the protein product of that gene. The gene has an ATG codon at the beginning, and one of the stop codons (TGA) at the end. Note that all the codons (except TGA) direct that a particular amino acid be placed in a particular position in the protein. The codes for all twenty amino acids and the “start” and “stop” codons are known (figure 4.6), and these are called the genetic code. The genetic code is universal (with very minor exceptions), a fact that led James Watson to claim that “what’s true for E. coli [a bacterium] is true for an elephant.”
At this point, let us examine in greater detail the process of translation. To understand this mechanism, it is useful to think in terms of two different languages. In that sense, both DNA and RNA “speak” a language where the “words” are codons made up of three bases. RNA “speaks” almost the same language as DNA; let us call it a “dialect” where U replaces T. The DNA language and the RNA dialect are in fact indistinguishable because U and T have the same “meaning.” However, proteins do not “speak” the base language or dialect: they speak the language of amino acids. How is one language (and its dialect) translated into the other? In human affairs, an interpreter does the job. Does it work the same way in living cells? Yes, it does.
image
Figure 4.6 The Genetic Code. Note that amino acids can be coded for by several codons. Special codons for start and stop are shown in dark boxes.
In the early 1960s, researchers discovered a class of small RNA molecules, called transfer RNAs (tRNAs), that have the ability to bind specific amino acids. What’s more, it was realized that these tRNAs also had the ability to “recognize” mRNA codons via three complementary bases that they possess in what is called an anticodon. Therefore, these tRNAs can communicate with both the “base language” (thanks to their anticodons) and the “protein language” (thanks to their ability to bind amino acids). What happens (figure 4.7) is that tRNAs loaded with their amino acids line up along an mRNA molecule, interact through their anticodons with specific mRNA codons, and, by doing so, bring the amino acids they carry into very close contact. These amino acids can then form chemical bonds between one another and so produce a protein.
The “assembly line” containing the mRNA, tRNAs, and a growing protein needs to be held together in a precise fashion to function properly. This is achieved in cells through interactions with particles called ribosomes. Ribosomes are composed of many proteins and several RNA molecules. These are arranged in such a way that ribosomes can slide along an mRNA molecule, while cavities in the ribosome structure hold tRNA molecules loaded with an amino acid in the right orientation. This ensures that codons, anticodons, and amino acids are aligned properly in close proximity and in the right order. The full protein “assembly line” is depicted in figure 4.8. Note that protein synthesis start at the AUG RNA codon (with methionine). At a UGA codon, the ribosome cavity facing a UGA codon remains empty because no tRNA with an ACU anticodon exists, thus protein synthesis stops. The now fully formed protein then leaves the ribosome and becomes free to exercise its function where it is needed.
image
Figure 4.7 Translation. Messenger RNA encodes a gene depicted at the top in triplet codons. Transfer RNAs (tRNAs), which translate the triplet codes into a sequence of amino acids, are depicted by the curlicue structure of their sugar-phosphate backbones. At one end of the tRNA are three letters corresponding to the triplet anticodons that are complementary to the codons in mRNA. The corresponding amino acids are attached to the opposite ends of the tRNAs. The process of making proteins involves breaking the bonds between amino acids and tRNA and making new bonds to hold the amino acids together in the new protein.
Transcription and translation basically work the same way in all living cells, from bacteria to humans.
Changes in DNA Modify the Amino Acid Sequences of Proteins
Given that codons in the DNA completely determine the nature and location of amino acids in proteins, it is easy to see that if, for any reason, codons in a gene change, this will result in the insertion of a different amino acid in a protein. This is basically how mutations (changes in DNA base pairs) manifest their effects. Mutations will be studied in chapters 7 and 8. For now, let us just look at a few consequences of some base pair changes in the gene that codes for β-hemoglobin, one of the protein chains that constitute hemoglobin. We saw in chapter 3 that sickle-cell anemia is a very serious genetic disease. Its molecular basis is very well understood. Let us look at a small portion of the normal β-hemoglobin mRNA, which corresponds to codons 5–8:
image
Figure 4.8 A Protein Being Made from mRNA. A cartoon image of protein being made off of mRNA. The wavy line is the mRNA with a ribosome in the process of translating the sequence of triplet codons into their corresponding amino acids. Once the amino acids on the tRNA are attached to the growing protein chain, tRNAs without amino acids (on the left) are released. Then the ribosome moves over to the next codon and another tRNA with the complementary anticodon with its amino acid comes into the ribosome.
… CCA-GAA-GAG-AAG.…
The genetic code indicates that the translation product of this mRNA is the amino acid sequence
… Proline–Glutamic acid–Glutamic acid–Lysine.…
It turns out that this gene is mutated in sickle-cell anemia patients and produces a mRNA with the sequence
… CCA-GUA-GAG-AAG.…
You will notice that the second base of the second codon (in bold) is now a U instead of an A. This change encodes a protein with the amino acid sequence
… Proline–Valine–Glutamic acid–Lysine.…
Thus, the amino acid valine replaces the glutamic acid found in normal hemoglobin, because GUG codes for valine while GAG codes for glutamic acid. This single amino-acid substitution occurred because the β-hemoglobin gene underwent a mutation, whereby an A-T base pair in the DNA was turned into a T-A base pair. In addition, this single change of an amino acid is responsible for all the problems associated with sickle-cell disease. This is because mutant β-hemoglobin, with this seemingly minor change from a glutamic acid to a valine, has a strong tendency to crystallize in red blood cells and sickle them, contrary to normal β-hemoglobin. Many genetic diseases are the result of such subtle mutational changes in codons.
Gene Regulation
The genome is the sum total of all the genes present in an organism. For the organism to function properly, not all genes in the genome can be active at all times. Similarly, not all cells of a multicellular organism express all the genes present in its genome. For example, human pancreatic cells do not make hemoglobin, but they do produce insulin. On the other hand, progenitors of red blood cells synthesize hemoglobin but do not make insulin. Also, human embryos produce a special hemoglobin that is not found in the adult. How is this possible, knowing that all cells in the human body contain the same DNA? The answer is that cells can turn genes on and off to regulate which genes are expressed in what cells and when. We say that the insulin gene is turned off in all cells except specialized pancreatic cells, whereas the gene coding for embryonic hemoglobin is turned on during fetal life but is turned off later on.
Gene regulation is very complicated and only partially understood. It takes place in all living organisms, from bacteria to humans. Often, whether a gene is active (transcribed) or inactive (not transcribed) depends on interactions between the gene’s promoter and protein factors present in a given type of cells. Basically, when these protein factors are present, they bind to the promoter region and physically block it (figure 4.9). Under these circumstances, RNA polymerase has no access to the promoter and hence cannot transcribe the gene. Without mRNA production, the gene cannot produce its protein, and thus the gene is turned off. Conversely, other factors can bind to this blocking protein factor and prevent it from blocking the promoter. In this case, RNA polymerase has access to the promoter. The gene is then transcribed, its mRNA subsequently translated, and the gene is turned on. Cells contain thousands of such factors that act in concert to fine-tune gene activity.
image
Figure 4.9 Regulation of Transcription. A. Definitions of symbols used in this figure. B. A portion of DNA diagrammed to show that when RNA polymerase binds the promoter region, transcription can proceed. C. If a protein factor binds to the promoter region of the DNA, RNA polymerase cannot bind and thus the protein factor prevents transcription of the gene.
Embryonic Stem Cells Exhibit a Special Type of Gene Regulation
Embryonic stem cells provide an interesting example of gene regulation. Embryos contain stem cells, cells that can be coaxed into producing any and all types of human tissue. By definition, a young embryo contains cells that will eventually produce all the organs found in a complete human being. In other words, the cells of a young embryo are pluripotent, and can differentiate into any cell type, such as brain, heart, or muscle cells, and so forth. Researchers are keen on understanding what triggers embryonic stem cells to differentiate into specialized tissues. Once these mechanisms are understood, it will be possible to regenerate organs that are faltering due to accident or illness. The idea of repairing the spinal cord of paraplegic and quadriplegic individuals comes to mind. For now, suffice it to say that not much is known about the signals that allow embryonic stem cells to differentiate into various organs. One thing is clear, however. Before differentiation, embryonic stem cells express a large collection of genes; many of their genes are “on.” As the cells differentiate, many genes are progressively turned “off,” and specialized genes, which are only expressed in differentiated organs, are turned “on.”
Box 4.1 Why People Are Saving Their Babies’ Cord Blood
Because stem cells are capable of turning into a variety of different specialized cells, such as brain cells or blood cells, they hold great promise for the treatment of diseases such as Alzheimer’s, heart disease, and a variety of cancers. Cells in an early embryo, just by their very nature, can give rise to all the cells in the body. However, embryonic stem cells are controversial because their use necessitates killing the embryo.
Another source of stem cells is the umbilical cord blood. These stems cells can give rise to cells in the blood including red blood cells, which carry oxygen, white blood cells, which are cells of the immune system, and platelets, which are involved in blood clotting. Similar stem cells are found in adult bone marrow. But the stem cells in the cord blood are younger and thus have a higher chance of providing a good match. Cord blood also has a higher number of stem cells. Until recently, the blood in the umbilical cord was routinely discarded with the placenta. We now can collect and store the stem cells from cord blood. The procedure requires that cord blood be collected soon after the umbilical cord is clamped off. Special processing and freezer storage makes the stem cells available years later to the individual from which the cord blood was collected. Although not all diseases treatable by stem cells can be treated with cord-blood stem cells, a large number of diseases associated with blood, immune systems, and other metabolic diseases can be treated with cord-blood stem cells.
In the United States, cord blood may be donated to many public blood banks, including the American Red Cross, for use by any one. The National Institutes of Health also maintain a cord-blood bank. Donating your baby’s blood to these two cord-blood banks does not cost the donor. Alternatively one can pay approximately $2,000 to store the baby’s cord blood in a private cord bank, Cord Blood Registry (http://www.cordblood.com), and have it available for the donor’s use.
One report of how cord blood can save a life was reported on CBS’s 60 Minutes II. This is the story of Keone Penn, a teenager with sickle-cell anemia. This is a disease of hemoglobin in the red blood cells. In the past, the only possibility for a cure was a bone marrow transplant. However, for Keone, there was not a good donor match. But doctors thought that cord blood might help to cure Keone of his sickle-cell disease. It would be easier to find a good match of cord blood, and cord blood has higher numbers of stem cells than even bone marrow. Fortunately for Keone, a good match was found in the New York Public Blood Bank. His own bone marrow stem cells with the sickle-cell gene were killed with chemotherapy, then he received the matching cord blood, which proceeded to make all the different blood-cell types, including normal red blood cells without sickle cells.
Leukemia is a cancer of the blood cells. Cord blood can be used to replace the cancerous cells later in the life of the donor, but in an unusual twist, cord blood can even save the life of the donor’s mother! Patrizia Durante was pregnant with her first daughter when she was diagnosed with leukemia. Her daughter was delivered premature, and Patrizia was aggressively treated with chemotherapy to rid her of her leukemia. Again, the traditional treatment is to look for matching bone marrow, but again her doctors could not find a match. So they used her baby’s cord blood instead. Not only did the baby’s stem cells start rebuilding her mother’s blood, they also attacked the cancerous cells.
With these and many other successes, there is increasing interest in cord-blood banking. We can now make good use of blood that until recently was thrown into the garbage.
Summary
The genetic information, stored in DNA base pairs, is first converted into messenger RNA molecules (mRNA) that are copies of the DNA genes. This process is called transcription. Regulation of transcription is performed by base sequences that represent transcription start and stop sites, called promoters and terminators. The messenger RNA molecules are then translated into the amino acid sequence of proteins by transfer RNAs (tRNAs) and ribosomes. The genetic code consists of codons, three-base-pair sequences that specify amino acids, as well as translation start and stop codons necessary to form proteins. As a result, changes in codons at the DNA level, mutations, can result in the incorporation of the incorrect amino acids into proteins. Proteins with altered amino acid sequences can malfunction and lead to a diseased phenotype. Finally, we also saw that gene activity can be turned on and off at the level of transcription by protein factors that can prevent or allow transcription.
Try This at Home: DNA Replication, Transcription, and Translation Game
With this game, we can graphically see how genes are copied, transcribed, and translated into protein. The purpose of this game is to faithfully replicate a piece of DNA and translate it into the correct protein sequence. It is written as a game involving a number of teams, but if you want to do this on your own, just don’t check your sequence with the earlier sequence.
Procedure
The first member of each team will be given a DNA sequence, which may not be shown to the other members. The team members verbally "replicate" the DNA five times. For example, if person #1 has the sequence ATG, #1 figures out the complementary sequence, TAC, and says to #2, "TAC." Person #2 writes TAC, figures out the complementary sequence, ATG, and says "ATG" to #3, and so forth until person #5. Person #5 writes down the sequence he or she gets from #4 then determines the mRNA complementary to it. In this case, it would be UAC, so person #5 says "UAC" to #6. Then #6 determines the amino acid sequence corresponding to the mRNA using the single letter amino acid abbreviation given in figure 4.6. In this case, it would be the amino acid tyrosine, or Y. Team members may help in translation but may not provide the DNA sequence they wrote down.
The first team finished wins if the replication, transcription, and translation are 100 percent correct. If there are any errors in replication or translation, they will lose 10 seconds for each error in replication that does not result in an error in the final protein sequence and 20 seconds for each error in protein sequence. Then the resulting adjusted time determines the winner. Often in such a game, mistakes are made in the process. The longer the sequence of DNA, and more times that it is copied, the higher the chance for errors in the sequence, that is, mutations. This game imitates real life well because larger genes have a higher chance of mutating, and also, the more often genes are copied, the higher the chances of mutations.
Here are a couple DNA sequences that make an interesting translated amino acid sequence expressed by the single letter abbreviation of amino acids.
 
CCTCTTTTGCTCTGATAAACATCGTATTCGTTGCTTCGCTGTATT
ACCCTTGTGCGGCAACTCCCTGTTGTCCTACCCCTCTTACTCTCG