The most famous three letter acronym in science is short for ‘deoxyribonucleic acid’. DNA is the most important chemical in genetics, because it is within the intricate structure of its long molecules that the code for all our genes is held. Every person has 2 metres (6 ft 6 in) of DNA in each of their cells. – added together, the total DNA in a human body would stretch to the Sun and back 66 times. The vast majority of a cell’s DNA is housed in the nucleus, with a tiny amount in the mitochondria.
DNA was first isolated in 1869 when Swiss physician Friedrich Miescher (opposite) analysed pus collected from bandages covering infected wounds. It was found to contain the sugar ribose, phosphates and nitrogen-containing organic acids. It was soon firmly associated with the contents of the cell nucleus, and its acids were named nucleic acids accordingly. In 1928, DNA was found to reside in chromosomes, confirming that it was indeed the long-sought genetic material that carried genes between generations. The question was: how did it do it?
Components from Crick and Watson’s groundbreaking 1953 model of the DNA double helix.
To understand how DNA carries genes, it was first necessary to figure out its structure. DNA is a polymer, meaning it is made up of many smaller units chained together. Those units were recognized, but it was not understood how they linked. Individual DNA molecules are too small to see directly, so it was not until the 1950s that the pioneering technique of X-ray crystallography was able to solve the puzzle.
X-ray crystallography made use of the phenomenon of diffraction, a process common to waves of all types. When a wave passes through a gap that is narrower than its wavelength, it propagates from the gap in all directions as if it was starting again from a point source. X-rays have a small enough wavelength to diffract as they shine through the spaces in a molecule. The diffracted waves that emerge on the other side interfere with each other to create a pattern of light and dark bands that can be interpreted to pinpoint the relative positions of gaps in the molecule, and reveal its shape.
The regular arrangement of atoms in a crystal produces an ordered diffraction pattern. Careful analysis of the patterns can reveal the structure of organic molecules, such as DNA.
The first X-ray crystallography results for DNA were interpreted by the American Nobel-Prize winning chemist Linus Pauling in 1951. He proposed that DNA molecules had a spiral structure he named an alpha helix. This proved to be not quite right, however, and in 1953 two researchers working in Cambridge – James Watson and Francis Crick (see here) – showed that the molecule was actually a double helix, composed of two strands of DNA that linked together.
The DNA double helix can be understood as a twisted ladder. Its two ‘uprights’ are made from chains of ribose sugars bonded together by phosphates, while the ‘rungs’ are pairs of nucleic acids, now known simply as bases, that are strung between the chains of sugar. This unique structure allows DNA to be separated into two distinct strands by splitting the base pairs in two. There are four base units used in DNA, and it is their order through the double helix that carries the coded information of every gene.
Francis Crick and James Watson won the Nobel Prize for Medicine in 1962 for their discovery, a decade earlier, of the double helix structure of DNA, along with their work in showing how it could be used to store genetic information.
Englishman Crick (opposite, right) had worked as a physicist and weapons designer during the Second World War, before turning to biology, where he developed mathematical theories for use in interpreting X-ray diffraction. Watson (left), an American molecular biologist, joined Crick at Cambridge’s Cavendish Laboratory, where they collaborated to deduce the now famous structure of DNA. They did not perform their own X-ray diffraction experiments, but were instead supplied with data by Maurice Wilkins, the supervisor of a lab at King’s College, London. Wilkins shared the 1962 prize with Crick and Watson, even though he had not collected the images himself – the crucial evidence was gathered by his late colleague Rosalind Franklin. She was never consulted, and controversy over her role has raged ever since.
One particular X-ray crystallography photograph was pivotal to allowing Crick and Watson to solve the mystery of DNA’s molecular structure. The pattern of light and dark regions in the famous ‘Photo 51’ provided the evidence they needed to confirm their idea that a double helix was indeed the true form of the DNA found in chromosomes.
Photo 51 is often attributed to Rosalind Franklin, but it was actually produced by Raymond Gosling, a PhD student working under her supervision at King’s College, London. Franklin had filed the photo for later study, and it was then given to Maurice Wilkins by Gosling several months later. Wilkins showed it to James Watson, who immediately saw its potential – in the late 1940s, chemist Linus Pauling and his team at the California Institute of Technology had used the X-shaped diffraction patterns produced by many proteins to prove that they contained helical shapes, and the strong X-pattern in Photo 51 suggested that DNA too was a helix.
By the time the Nobel committee chose to award Crick, Watson, and Franklin’s erstwhile colleague Wilkins a prize in 1962, Rosalind Franklin was already dead. She had passed away in 1958 at the age of just 37, killed by ovarian cancer that may have been brought on by her repeated exposures to dangerous levels of X-rays. The Nobel rules dictate that a prize is never awarded posthumously, so we will never know whether, if Franklin had lived, she would have joined the three men as an equal winner.
In later years, Franklin’s contributions to DNA research have been more widely acknowledged. Her major breakthrough was not Photo 51 per se. Instead, she had discovered that the DNA found in a cell’s nucleus is in a highly hydrated form, which she called B-DNA. (A-DNA is a less hydrated form that is uncommon in biology.) Crick and Watson’s eventual success came from modelling the structure of this B form of DNA, a right-handed double helix.
A DNA double helix contains millions of ‘base pairs’, each consisting of two nucleic acids bonded together that run across the axis of the helix. Crick and Watson’s discovery of the double helix showed that the base pairings formed according to a set of simple rules.
There are four bases in DNA: adenine, cytosine, guanine and thymine. Each of them consists of ringed organic compounds containing nitrogen atoms. Adenine and guanine share a similar structure, having two rings, while cytosine and thymine have just one apiece.
For a base pair to fit across a ‘rung’ of the helix molecule, however, there is only room for three rings. Adenine always pairs with thymine, while guanine pairs with cytosine. There is no restriction on the direction of the pairing. Any of the four bases can appear on one side of the helix, but its partner will always be on the other side.
The four DNA bases – adenine, cytosine, guanine and thymine – create a four-character code, generally abbreviated to the initial of each base: ACGT. The Human Genome Project has found that the DNA in a human cell contains a code of more than 3 billion of these characters – printed out, they fill 130 printed volumes of solid text. By reading one letter a second, it would take 95 years to read the whole thing – and much of its meaning is still a mystery.
Because each DNA molecule is a double helix, it is built from two strands of DNA entwined together. Each strand carries a code using the ACGT alphabet, running to millions of characters long. What’s more, the code on one strand is always mirrored on the opposite strand that carries the partner bases. However, only one of these strands carries the ‘live’ genetic code used by cellular processes – this is known as the ‘sense’ strand, while the opposite one is the ‘antisense’ strand.
During every cell division, all the DNA in the nucleus must be copied or replicated. The replication process makes use of the fact that one side of the DNA double helix can act as a template for its partner. Replication can take place through purely chemical means in a process called autocatalysis, but inside the cell it is heavily managed by enzymes to ensure that a high-fidelity copy is produced. The double helix is unzipped into two strands. Individual bases are paired up with their partners on each exposed strand, and a new ribose backbone is constructed to keep them in place. The result is two identical double helices. ‘Proof-reading’ enzymes check the pairings to ensure that there are no errors.
At the level of the chromosome, this results in two structures, known as chromatids, that carry identical copies of DNA. The chromatids are connected at a point called the centromere. When the chromatids separate during cell division, they become individual chromosomes in their own right.
DNA replication is managed by a suite of enzymes that unzip the DNA helix and use each side as a template for two new identical DNA molecules.
During the replication process, all the DNA in a cell is involved. However, when DNA code needs to be read by the cell, only certain portions of the strands are used – the rest is simply ignored. The sections of a double helix that contain a meaningful code are known as exons, while the unused portions are called introns.
Introns of varying lengths are muddled among the exons along the DNA. The introns are often termed ‘junk DNA’, suggesting that they have no use – they are simply there because the cell faithfully copies it along with all the other bits of DNA. However, the preferred scientific name for it is non-coding DNA, since its actual utility is still much debated – perhaps it has some as yet unknown purpose? According to the latest estimates, more than 90 percent of human DNA takes the form of non-coding introns. This figure is much higher than in simpler organisms such as bacteria, where barely 10 per cent is intronic, so it may be that introns arise naturally as DNA grows more complex.
While much, if not most, DNA consists of non-coding introns, the process of transcription ‘edits out’ excess information.
The discovery of the double helix was merely the first step in solving the mystery of genetic inheritance. Over the ensuing decades, a picture was gradually pieced together to show how the information coded in DNA is deciphered and enacted by the cell. This awe-inspiring mechanism has become known as the ‘central dogma’ of molecular biology.
The central dogma equates a particular portion of DNA, known as a cistron, with a particular protein used by the living cell. Each cistron contains the information to build one single type of protein. This means that ‘cistron’ and ‘gene’ are two words for the same thing – a single unit of inherited information.
Secondly, the dogma shows that the information on the cistron only travels in one direction: the DNA code can be used to create a protein, but the structure of the protein cannot be translated back into a DNA code.
While DNA is the repository of the information that forms a cell’s genes, the cell actually uses a related chemical called RNA to read it. RNA is short for ‘ribose nucleic acid’. Its molecules are broadly similar to DNA in their fine structure, save for the different way in which the sugar backbone is constructed from ribose sugar molecules linked by phosphate groups. The presence of a hydroxyl (OH) group attached to the ribose ring affects the structure and makes it impossible for RNA strands to form into long double helices. Instead, they are more often found as robust clusters of short helices or as single strands. RNA uses three of the same bases as DNA: adenine (A), cytosine (C), guanine (G). But in place of thymine (T), RNA uses a base called uracil (U). Therefore, the genetic alphabet of DNA – ACGT – translates into ACGU when RNA is involved. Thanks to its robust nature compared to the more fragile DNA molecule, RNA is the workhorse of the central dogma process that copies, transports and reads genetic information within every cell.
While genes are stored in the nucleus, the proteins they code for are assembled outside, in the cell’s cytoplasm – and specifically at tiny organelles called ribosomes. As a result, genes need to be copied and transported from the nuclear DNA to these external sites. The copying process is called transcription, and unlike the complete replication of the DNA helix, it results in a single copied strand – of RNA, rather than DNA. The aim of transcription is to make a copy of the ‘sense’ strand – the side of the DNA helix that carries the protein-building instructions. Therefore, the helix is unzipped and the opposite ‘antisense’ strand is used as a template for the resulting piece of RNA.
This copy contains all the introns in the gene as well as the exons, and the introns are snipped out to create an edited copy. This finished strand, manufactured in the nucleus, is called messenger RNA (mRNA). It is then hauled through a pore in the nuclear membrane and begins its journey to the ribosome.
The minute bodies known as ribosomes are often lumped in with other cell machinery under the catch-all term of organelles, but they are in fact distinctly different, since they are found in both eukaryotic and prokaryotic cells (and the latter by definition do not have any of the large internal structures classed as organelles). The widespread distribution of ribosomes is a pointer to their significance in cell biology, and their primordial nature – they are generally presumed to have evolved before the major evolutionary separation of eukaryotic and prokaryotic organisms.
A ribosome is a site where the information coded on a strand of mRNA (messenger RNA) is read and used to assemble a protein. The ribosome itself is largely made from another form of RNA, known appropriately as ribosome RNA or rRNA. The rRNA forms two subunits, one large and one small. Every strand of mRNA is threaded between these two units during a reading process known as translation.
Protein synthesis involves two key phases – the copying of DNA information onto messenger RNA in the cell nucleus (transcription), and the formation of a protein using that information, known as translation. Ribsomes expose mRNA to a reading system composed of yet more RNA structures – this time made of transfer RNA. This tRNA is a loop of RNA with three bases exposed at one end. These bases correspond to three partner bases on the mRNA held in the ribosome. Each of the mRNA’s three-letter codes is called a codon, while the corresponding three letters on a tRNA is its anticodon.
MRNA passes through the ribosome one codon at a time, allowing a tRNA to match up with each one. Attached at the other end of the tRNA is an amino acid, the building block of a protein molecule. Each codon and anticodon pairing corresponds to a specific amino acid, in a specific position among the hundreds that are required to build a particular protein. The protein is assembled by reading all the gene’s codons in the right order.
The codon is the interface between the genetic world and the metabolic one. A codon is a simple three-letter string of bases, each coding for a particular amino acid, and a gene is a collection of codons lined up in a specific order. That order translates into a chain of amino acids, which forms the basis for a specific protein molecule used by the organism. The four letters of the genetic code can produce a total of 64 possible codons (43). However, organisms use only 23 amino acids for protein synthesis, and so most of the amino acids are represented in the code by more than one codon.
However, there are also codons that act as markers to show where one gene starts and ends. ATG is the so-called initiation, or start, codon. Each codon after that is translated into an amino acid (including ATG, which subsequently represents the amino acid methionine). The process continues until the gene presents one of three possible stop codons, which halt the translation process.
It is frequently said that amino acids are the building blocks of proteins. That is true enough, but in fact only 23 out of the 500-plus known amino acids are used for this purpose (while a handful of others have metabolic roles not linked to proteins). Amino acids are simply a class of organic (carbon-based) chemical compounds. All of them contain a carboxylic acid grouping – the same one that gives vinegar and lemon juice their acidity – plus an amine group, which contains nitrogen.
This nitrogen content is the crucial factor. Nitrogen is very common in the biosphere – nearly 80 per cent of the air is made from it. However, animals are unable to access free nitrogen directly. Instead, they must consume the bodies of other organisms to get the amines they need to build protein. Plants can construct amino acids from nitrate compounds in the soil (hence, why farmers top this up using fertilizers.) However, even they must rely on bacteria to ‘fix’ nitrogen gas from the atmosphere and make it available to them in nitrate form.
All amino acids have a common structure, with an NH2 amine group and a COOH carboxylic acid group bonded to a molecule of carbons, R, which can be of any size and shape.
More than 100,000 different proteins are used in the human body, all constructed from long chains of amino acids. A single chain of amino acids is called a polypeptide. One cistron (DNA unit or ‘gene’) carries the code for one polypeptide, and a protein molecule contains at least one of these – many have two or three, each coded for on separate cistrons.
A protein can have anywhere from 400 to 27,000 amino acids inside it. The number of different permutations of acids in molecules of this size is effectively limitless, but the precise ordering encoded in the genes gives every protein a unique shape, which in turn gives it a specific metabolic role. The order of amino acids is a protein’s primary structure, while bonds between amino acids at different points on the chain create a twisted secondary structure, which then folds in on itself to make a complex tertiary structure. A single change in a protein’s primary structure results in new secondary and tertiary structures, creating a very different molecule.
The structure of a protein must be understood on at least three levels: together these structures give each protein its unique shape.
Proteins are commonly seen as synonymous with meat, and particularly muscle. But while it’s true that the contraction of muscles is brought about by paired protein molecules pulling against each other, protein has a structural role throughout the animal body. Collagen is the primary example – it forms the foundation layer of skin and other connective tissues that hold the body together.
However, most proteins are used as enzymes. These are the biological equivalent of chemical catalysts, and facilitate the many chemical reactions that are required to keep an organism alive. Enzymes are intimately involved in these reactions, but are not used up by them. Each enzyme has a specific set of abilities, based on its shape, and they are at work both inside the cell and beyond it. For example, the replication of DNA is managed in the nucleus by an enzyme called DNA polymerase. By contrast, amylase is a digestive enzyme, secreted into the mouth and stomach: its role is to break starchy foods into simpler sugars.
An enzyme’s function arises from its shape. The folded molecule is able to bring other substances together to react in a way that would not otherwise happen.
An enzyme acts on specific target molecules known as the substrate. This may be one molecule that is split apart by the enzyme’s action, or two or more molecules that are joined together. The enzyme is able to make substrates react in ways that would not occur otherwise, because the energy required for the reaction to start spontaneously is too great. So an enzyme is able to manipulate the substrate in such a way that this energy barrier is removed. The best guess at how it does that is the ‘lock and key theory’.
In this model, the enzyme is a ‘lock’, meaning that it has an active site – a location that contacts the substrate – with a specific shape. The substrate is the ‘key’ that fits into the active site exactly. When fitted together, the enzyme is able to weaken particular bonds in the substrate, pulling some areas apart and bringing others together to bring about the reaction. Often attendant molecules or coenzymes, which are frequently derived from vitamins, are required for a full enzyme function.