2

The DNA Dinner

Yet som there be that by due steps aspire
To lay their just hands on that Golden Key
That ope’s the Palace of Eternity

JOHN MILTON

I contemplated the two half-full shot glasses sitting in front of me. They were nicely presented on a small wooden board, sitting just beyond my dessert spoon. Everyone at the table had the same arrangement. My pre-dinner drink was still half full, and I didn’t particularly want to go straight to shots — but, on the other hand, there was a mood of celebration in the air, and it was a little tempting.

It’s just as well that neither I, nor anybody else in the large, crowded hotel ballroom, succumbed to that temptation. That glittering occasion — the 50th anniversary of the discovery of the double helix structure of DNA — may not have been quite the same had a dinner guest been poisoned.

*

The DNA Dinner was a highlight of the 19th International Congress of Genetics, held in Melbourne, Australia in 2003. Its organisers were people who were open to ideas, and not afraid of consequences. This made for a night that was thoroughly memorable, if a little hazardous. Huge helices of balloons coiled up the walls, as you might expect. Not so predictably, whoever was responsible for them had apparently decided that the double helix looked a bit thin, so the balloon decorations paid tribute to the triple helix instead.9 Francis Collins, leader of the Human Genome Project, played the guitar and sang ‘Happy Birthday to You’ … to the human genome.

[1 The triple helix is important, too — collagen, one of the major types of protein that make up your body, has a triple helix structure. But this wasn’t the Collagen Dinner.]

And then there was the poison.

To be fair, apart from the potential for harm, it was a brilliant idea. One of the glasses contained a nearly complete extraction of DNA from a plant. The other contained the last chemical needed to complete the extraction. At the beginning of the night (probably just in the nick of time), the MC told us to pour the contents of one glass into the other — and before our eyes, DNA precipitated out. There it was, in front of us — the magical stuff that had brought us all together.

You can do your own DNA extraction at home, using household ingredients (only one of which is poisonous). I’m told that cookbooks sell well — so here’s a recipe.

Ingredients

Preparation

  1. Add one teaspoon of salt to half a cup of warm water, and stir until dissolved.
  2. Add two teaspoons of dishwashing liquid to the salty water. Stir gently (you don’t want it to froth).
  3. Put the strawberries in a ziplock sandwich bag and seal. Mash the strawberries inside the bag thoroughly by hand.
  4. Pour the salty, soapy water into the bag with the strawberries.
  5. Mix well. Be gentle, avoiding frothing of the mixture.
  6. Use a coffee filter to strain the mixture from the bag into a glass. Make sure you get a decent amount of liquid in the glass. It’s best not to use a tall, narrow glass for this step because it will make life difficult at the end.
  7. Pour rubbing alcohol gently down the side of the glass, in about a 1:1 ratio with the liquid from the strawberry. The alcohol will form a layer on the top.
  8. Now it’s poisonous. Do not drink.
  9. Allow to stand, and wait. DNA will rise into the alcohol layer, in a gloopy white mass.

At this point, if you like, you can take a wooden skewer and lift the DNA out of the glass. It has some interesting properties. Dab it against the side of the glass a few times — the clump on the end of your skewer will get smaller. Lift it carefully enough out of the liquid and you may be able to get a long, thin strand to rise up from the surface. DNA is sticky, and can be loosely or tightly coiled. When it shrinks, that’s loose DNA coiling more tightly as it sticks to itself. The string of DNA you can pull up from the surface of the liquid is a series of individual strands, sticking to each other as they are lifted out of the container.

I strongly urge you to give this a go. It is deeply satisfying to know that you are holding the stuff of life in your hands. Even if it does look and feel rather like snot.10

[2 No, I don’t know what it tastes like.]

*

Francis Collins was the focus of attention that night at the DNA Dinner, not because of his guitar playing and singing (fine though they were), but because of his leadership of the Human Genome Project. Three years previously, on 26 June 2000, there had been a ceremony at the White House, announcing that the human genome had been sequenced. President Bill Clinton hosted the event, and British prime minister Tony Blair joined by satellite. Collins and the publicly funded Human Genome Project shared the limelight that day with a private company, Celera. A remarkable effort, led by Celera’s CEO, Craig Venter, had turned the sequencing of the human genome into a race, ending in an effective dead heat between the public and private efforts.

Some might say that the ceremony jumped the gun a bit, because there were still so many gaps in the sequence — no fewer than 150,000 of them — and because at least 10 per cent of the sequence was still missing. In fact, there was another announcement on 14 April 2003 that the project was really finished, but, even then, there were still gaps. By 2004, things had improved considerably, yet there were still 341 gaps … and even today, the job is not quite finished.

Nonetheless, at the time of the announcement in 2000, there was a good working draft — and to be fair, that was what was announced: completion of a draft sequence. Most researchers, most of the time, could consult the data with the expectation that there would be detailed information available about the region they were interested in. It was an exciting time, but it still wasn’t entirely clear to those of us in the clinical world exactly what we were going to use the genome sequence for.

One day in late 2001, a package arrived in our department that was to thoroughly prove this point. It was the human genome on disk — sent to us, free of charge, by Celera. Excitedly, we opened it up, loaded it onto a computer, and started exploring the contents. And quickly stopped. We had no idea how to interpret the information we had been sent, and no way to connect it with our patients. As it turned out, it would be more than a decade before genomic data would become a routine part of practice in clinical and diagnostic-laboratory genetics. Nowadays, I consult the genome browser run by the University of California, Santa Cruz numerous times during the course of a working week. I could not do my job without it.

So — what’s in the genome? What exactly am I browsing, thanks to UCSC?11

[3 I’m afraid you might find my browser history a little dull.]

That white gloopy stuff you extracted from the strawberries is made up of four different chemical building blocks: adenine, cytosine, guanine, and thymine. Their initials are A, C, G, and T, and they are called nucleobases, or bases. The human genome is composed of about three billion bases altogether. Most of the time, they exist as base pairs, because of the double helix in which DNA usually exists. That double helix is made of two individual strands, which complement each other. A on one strand sticks to T on the other, and C on one strand sticks to G on the other, so that the double helix looks like this:

G A T T A C A
| | | | | | |
C T A A T G T

The two strands run in opposite directions — there is a direction to DNA, related to the way it is copied and translated to make proteins. So the strand that complements ‘GATTACA’ would be read as ‘TGTAATC’ by the cell’s machinery, not ‘CTAATGT’.

Three billion bases of DNA is an awful lot. To put that in perspective, here’s a chunk of human genetic code:

GAGGGGTGACAATAGGAATATTTGCTTTTCATCCCTCATGATCAT
CACCCTTCCCTTCTTCCACTCTTACATTTTTATTTCCCAAAGT
GGGCTGCAGTACTGGTGGTGAAAAGCAGTCATTCTAGAACT
CACTAGCTTGGTCAGGAAACACTGGGCATGCTACTCCGTGGGAATC
CCAGGTAATTTTATACAGTGATGATGGGTCATCCCTACAGCCTG
CCTGGAGTTTCTGAGCTTAACTTTTCCTGAGTCAAATCCAAGTGAC
CATTTGGTTATGCTGTTCTTTCCAGATCTTTCCTTGTTGAACAGT
GTCAGCTGTCATTCACAATTCTCTTTACCTCCAGAGCAATTTGT
GGAGAAGTCGTCCTGTGCCCAGCCCCTGGGTGAGCTGACCAG
CCTGGATGCTCATGGGGAGTTTGGTGGAGGCAGTGGCAGCAGC
CCGTCCTCCTCCTCTCTGTGCACTGAGCCACTGATCCCCACCACCCCCA
TCATCCCCAGTGAGGAAATGGCCAAAATTGCCTGCAGCCTGGAGA
CCAAGGAGCTTTGGGACAAATTCCATGAGCTGGGCACCGAGATGATCAT
CACCAAGTCGGGCAG
GTAGTTGGGACTTGGGGGTTGGGGGTTGGGG
GTGGAGAGTCAGGACACTCCCTGGGTAGTTGAGGGTGCTTCCAG
GAACTAGATGAGAGCTGGCTGGTCATGGAGCGGAGAGACAGCTTG
GCTCCAGGGCAGCTGCTTTCCACCAGCTTGCATTAGGAGCTACAG
GATGTCTAGTCATTTGCGTTCTCAGGATTTGGTCATGGGAAG
CCCCACCCTGGCTTTGTTGAGAAGGGCACAGGGACCAGGGAG
ACACACTAACCCCGAAGGGTGTGGTCTGCTTTCCCTGGAGCTG
GAGAAGGTTTGGCGGGTGGAGGGTCGGGATCTGGAAGGAGGAG
GAATTTGTGCCTGGGTGCCTGGTGAGCTGCTGGGTGCTTCTAGG
TAGGTGAGTAGCTTCCCTTTTATCAGCCTCAATTTGCAAAAGCTG
CCAGCTCCCATTAAAAACTAAAATTAAAACCTGGGCGGAAGAAT
GAAATTTGAAACGATAAAATTCCCTGTAGGAAGGAGCACTGCTC
GGGGCCTCTTGGCGCCAGAGCCGGGCGGGCTTTGGCCAGGCAG
GAAGCTGCAGGGCTGCAGGGAGGTTGGGATGGGGCAGAGGCTGG
CAAAACTTGGTGGCTCTAGCTCTTGGGACTACAGAAAATACCT
GCAGGGCATCTGAGAAATCCTTCCCAGAAACCTCTGCTTTTG
GCTTTTATTTTGCAAGAGCAGAGTTTTCTGGCTGGGATGCGGGT
GAGTTGTGTGACTGGGTCAGCTCCAGGGACTTCGGGTCCTGGGA
CACTTAATGTGCTTGATCGTTAAAATGCATGGGATTTTCCCTA
ATCACAGACCTTCTGGAGTTAACACATACCCCCACCCCCAC
CCCCACCTTTTCACCTAGCAATTAACACCTGCTTAAAGGTGA
CACTTAAAATTATCTAGGCTTGGAAGAAAACCCTGTCTCTGTAT
TCACTTCTCTGAGGCTTTAAACAAAACAAAAGAGGGGTTTGTG
GACCGGATAGAGAGGGGAGTCAGACCCTTTCCCTCCTTCCCTC
CCTCCCTTCCTCTCTTCCTAATTCAGGTCAGTTTATTAGGCAG
CATAAACAGGGCCCATTCTCTCTCTCTCTCTCTGTCAGGAGGAT
GTTTCCAACCATCCGGGTGTCCTTTTCGGGGGTGGATCCTGAGGCCAAG
TACATAGTCCTGATGGACATCGTCCCTGTGGACAACAAGAGGTACCGC
TACGCCTACCACCGGTCCTCCTGGCTGGTGGCTGGCAAGGCCGACCCGC
CGTTGCCAGCCAG
GTTCGTGCCTCCAGATTTTTCACTGAGAAAACT
GTTAGTGCATCTGTCAGAATGTTTCTGGCTTGTGTGAATTTTAAG
CAAGTGTATTTTTAAAGCAGCGGGCTCTGGCAAGAGAGCATTC
CAAGCCTGGACACTCCAGGATTGACTACACAAAACATGGGCTAG
GCTCTGAGAAAGGTAGTTTGTGCATAGAGAAAACACTGTCTTTA
AGTTTATGTTTCGTTAGGCAGTAATTCATTTCAAAGTTTTTCTTA
AATTTCAATTTGAGTATTCATTAGAAATGTGGACCCATTTTGTATA
AATATAAATATAGACATCCTCTCTAATTGCTGCTTAAAACCAGAGTGAA

This is one of my favourite bits of the genome — it’s part of the gene TBX20, which had a starring role in my PhD. Printed at the same density on A4 paper (single sided), you’d need 781,250 sheets of paper for the whole of the human genome. If each sheet of paper is 0.1 mm thick, you would need a stack of paper just over 78 metres tall (198 feet) — halfway in height between the Sydney Opera House and the Statue of Liberty. Without a key, of course, this would be just a stack of meaningless letters. With the key — that stack of paper contains untold scientific wealth.

What is the key? And what’s in the genome? It turns out that it’s more a matter of a set of keys, rather than just one. DNA tells many stories, if you can read them.

As we saw in the previous chapter, we have pairs of chromosomes12 — pairs, because you get half of your genetic information from your mum, and half from your dad. In turn, you pass on half to each child you have. So — one copy of chromosome 1 from mum, one from dad, and so on. Chromosome 1 is the largest. It’s about a quarter of a billion bases long, and is home to more than 2,000 genes. The smallest, chromosome 21, is less than 50 million bases long and holds only a couple of hundred genes. The humble Y chromosome is just a little longer than chromosome 21, but has only about 50 genes.

[4 Feel free to go back and admire my chromosomes again at this point.]

There is also some DNA outside the nucleus of the cell: we have a second genome, a tiny one (only 16,569 bases and 37 genes). It lives in structures called mitochondria — more about them later.

Speaking of genes — you’ve doubtless heard of them, because they are the most famous things in the genome. As I explained, their job is to act as a blueprint that tells the cells how to make proteins, which in turn do the many complex tasks a cell needs to do in order to be alive and contribute something useful to your body. The parts of the genes that get translated to make proteins only account for around 1–2 per cent of the genome.

There’s still quite a bit of controversy about how much of the rest of the genome actually does anything. Some of the non-gene bits are definitely useful and important. For example, the centromere, at the waist of the chromosome, is essential for making sure the chromosome copies go where they are supposed to when cells divide. Mess that up and the consequences are not good. The ends of the chromosomes have structures called telomeres, which form a protective cap. You may be familiar with the Bernard Bresslaw song about feet:

You need feet to keep your socks on

And stop your legs from fraying at the ends

Chromosomes are not known for wearing socks — but like your legs, it’s bad news if they fray at the ends. As you get older, the telomeres themselves do tend to fray a bit, gradually getting shorter with each cell division. During the development of many cancers, they wind up a lot shorter than they should be, or disappear altogether, leaving the ends of the chromosomes exposed and vulnerable to damage. Paradoxically, what happens next is a restoration of the telomeres — as cells are undergoing transformation to malignancy, their chromosomes reacquire robust telomeres. This is part of what makes a cancer cell ‘immortal’.

Although the parts of genes that code for proteins only account for about 1–2 per cent of the genome, genes spread across about a quarter of the genome. The reason for that difference is that most genes are a mix of two types of sequence — introns and exons. The exons code for protein, i.e. their sequence specifies which amino acids to include in the protein, as well as when to start and stop. By contrast, the introns don’t code for anything, and, while they undoubtedly still do serve a purpose, we still only have a fairly limited understanding of what that is.13 Introns can be truly enormous — many thousands of bases long. Sometimes, they are so big that there’s room for an entire gene to sit inside an intron of another gene, usually running in the opposite direction, on the other strand of DNA. The double helix is a two-way street.

[5 One well-understood function of introns is to enable the use of the same gene to make different versions of its protein, sometimes versions that have quite varied functions. This is done by ‘alternate splicing’ — some exons don’t always get used — so there is sequence that could be either exon or intron. Plenty of genes don’t do this at all, but there are some proteins that come in numerous flavours depending on the way that splicing happens. Another function of introns is to help control when and where a gene is switched on — a regulatory function.]

You can see exons and introns in that chunk of TBX20. The bold sections are exons; the bits in between are introns. You can even see some of the genome’s operating instructions, written right there in the DNA sequence. At the start of each intron are the bases GT; at the end of each intron are the bases AG. Together, GT and AG form a key part of a message to the cell’s machinery that says, ‘There’s an intron here. Not needed for protein — please cut out.’14

[6 This is something of a simplification. The GT and AG are key parts of the signal that says, ‘I’m a splice site,’ but the bases that surround them are important, too. There’s a more detailed description of the relationship between gene and protein in the notes section, if you’d like to read more about this.]

What percentage of the human genome actually does something? Well, we still don’t know. In September 2012, the results of a major follow-up to the Human Genome Project, called the ENCODE project, were released in 30 scientific papers, all in one hit. Getting that many scientists to cooperate so that 30 papers were released into the wild simultaneously was perhaps as impressive an achievement as the actual science in the papers. The ENCODE Consortium reckoned they had found a function for 80 per cent of the genome. Most of it was claimed to be busy controlling the function of other bits — a rather bureaucratic vision of cell biology. There was a lot of criticism of this announcement at the time, and the debate rolls on. Recently, a paper was published that argued that only 8 per cent of the genome is functional. That’s quite a gap. I have no idea what the right answer is, but I doubt it’s as low as 8 per cent or as high as 80 per cent.

There’s an awful lot of the genome which looks like genetic wreckage — genes and other elements that have lost their function over the course of evolution. For instance, there are loads of smell receptor genes that are broken and don’t do anything — earlier in evolution, our ancestors needed a keen sense of smell to survive, but for a long time we’ve been able to get by just fine with a comparatively poor sense of smell. So when those genes acquired mutations, it didn’t cause a problem, and the broken form was just passed on to future generations. You inherited hundreds of broken genes from your parents, and in turn you’ll pass them on — or have already — faithfully copied, and still not doing anything.

There are also lots of repetitive sequences that don’t seem to be up to much. Sometimes, viruses copy themselves into a host’s DNA, and there’s quite a lot of what look like old virus sequences scattered about the place. There are chunks of DNA that have been copied in what’s called duplication events. If you have spare copies of something, it doesn’t matter much if one of them loses function, so you often wind up with two versions of a gene, one that works and one that doesn’t, a pseudogene. And there are bits of DNA that seem to be there simply as a side effect of DNA’s drive to be copied: long, long strings of sequence that looks like nothing much (ATATATATATATATATAT …).

On the whole, this doesn’t seem to cause us any bother. There’s no apparent pressure on the human genome to get more efficient, or maybe whatever pressure exists is outweighed by DNA’s tendency to copy itself, and by the various mechanisms which introduce new bits of DNA into the sequence. There are plenty of other organisms that get by fine with much bigger genomes than ours, with proportionally even more freeloading DNA. There’s an amoeba, Polychaos dubium, that reportedly has a genome more than 200 times the size of ours. The humble onion’s genome is five times bigger than ours, and, on the whole, you are more likely to eat an onion (or extract its DNA) than the other way around. On the other hand, the pufferfish, fugu, has a genome only about an eighth the size of ours … and pufferfish are quite a bit more complex than onions.

It does seem like there might be a price to pay for an oversized genome, at least when times are hard. There’s a plant called teosinte, which is thought to be the ancestor of maize. In 2017, a paper was published that compared the size of the genomes of different species of teosinte, living at different altitudes. Many plants have enormous genomes — but in teosinte, at least, the higher the altitude, the smaller the genome. If you’re living in a tough neighbourhood, high on a mountain, you can’t afford to waste energy copying DNA if it isn’t doing a valuable job for you.

It’s just about possible that the human genome happens to be exactly the perfect size, so that everything in it has an important role. But that would seem like quite a fluke. More likely, the human genome does indeed carry around its fair share of true ‘junk’ DNA.

That’s not to say there isn’t anything impressive and interesting about the human genome. When I started working in genetics, we used to confidently tell people that the human genome contained about 100,000 genes — because we’re such important, special creatures, there had to be a lot, right? Then that estimate started going down … and down … and down. By the time the Human Genome Project was completed, the number had gone down to a bit over 20,000 genes. Part of the explanation for this is that our genes have quite complex structures, and an awful lot of them do more than one job. Sometimes, that means doing a similar task in slightly different ways, like a muscle protein that forms a bit differently depending on whether it is working in heart muscle or in regular muscle. Sometimes, though, it means the same protein can be used to do wildly different jobs. This is called ‘moonlighting’. For instance, there’s an enzyme — a protein that makes chemical reactions happen — that also plays an important role in making the lens of the eye transparent.

But you can say similar things about the genomes of a lot of organisms, and all of it about the chimp genome. Chimpanzees, especially the bonobo (or pygmy chimpanzee), are so close to us genetically that a Martian would probably view us as just different varieties of the same animal. We are closer to chimps than African elephants are to Asian elephants, so you could hardly blame our extraterrestrial visitor for being confused.

How do we know all this stuff? It comes back to the Human Genome Project.

At the time of its conception, the HGP was an enormously ambitious idea. Only a small fraction of the genome had been sequenced. Mostly, what we had was a kind of outline sketch — a map, in fact. You often hear the expression ‘mapping the genome’, and indeed that was the first step. But we never map a person’s genome now, because the job has been done already — just as you don’t have to map someone’s whole neighbourhood in order to find their house. A genetic map is not like a street map, because it only really has one dimension, not two — it’s all about what lies where along the string of DNA that makes up a chromosome. To make such a map, you need a series of markers — genetic signposts — with a known relationship to each other. Those signposts consist of bits of DNA with some way of uniquely identifying them. Say we have three such markers, A, B, and C. If we make a genetic map that includes A, B, and C, it would consist — at the least — of the information that they sit along the chromosome in that order — A-B-C — rather than A-C-B or any of the other possibilities. A better version would say that A, B, and C are all on chromosome 1, and not on any of the other chromosomes. And the most useful type of map would also tell us how far apart they are.

The earliest such maps were made in the early 20th century, for fruit flies. By 1922, genes for 50 different characteristics had been mapped to the four fly chromosomes. These were all physical differences in the fly that a researcher could directly observe. Flies would be examined for multiple different characteristics and mated with flies that had also been carefully examined, and the resulting offspring would be examined in turn. It was exacting, difficult work, but taught us a lot of fundamental information about genetics, and gave us tools that were used all the way through the 20th century, and were essential to the success of the Human Genome Project.

For the fruit fly’s X chromosome, for example, there was an early map that looked like this:

y..........w.......................................................................v....m

In this map, y stood for yellow body, w for white eyes, v for vermilion eyes, and m for miniature wings. The map means that yellow body and white eyes are closely linked — they are more likely to be inherited together — whereas the miniature wings variant is more likely to be inherited along with vermilion eyes than with white eyes. This particular map was the work of Alfred Sturtevant, another nearly forgotten genius, and was put together in 1913, when Sturtevant was only 21. At the time, he was working under the supervision of the great geneticist Thomas Hunt Morgan. Sturtevant seems to have been a child prodigy: by age 21, he already had a long track record of studying inheritance. Morgan had been impressed by Sturtevant when, still in his teens, he wrote a paper about how horses inherit coat colours — based on observations made as a child, on his father’s farm! The paper was published in a scientific journal, Morgan offered Sturtevant a position in his lab — and the rest is history.

That’s some school project.

Sturtevant went on to have a long and distinguished career in science, pausing only to marry Phoebe Curtis Reed, a technician who also worked in the fly lab. They had three children, who must have grown up with quite unusual ideas about what constitutes a normal topic for dinner table conversation.

From the point of view of genetic mappers, most of the 20th century was a hard slog. From mapping physical characteristics that you could see, things moved on to biochemical and other laboratory markers — in yeast, and eventually in humans. It wasn’t until 1987, 17 years after Sturtevant died, that the first genetic map spanning the whole human genome was published — 407 variable bits of DNA spread across the 23 chromosomes. If the Human Genome Project was the moon landing of genetics, Sturtevant’s early maps were our Wright brothers’ flight.

So, by the late 1980s, we had that outline map, like those early explorers’ maps of the world. We knew the outlines of our 23 land masses, and there were definite signposts spread along them (those 407 genetic markers), but, apart from a few important ports — chunks of known DNA sequence centred around disease genes — there was precious little detail to be found on our maps.

Going from a 407-marker sketch to a richly detailed map, and then to a completed genome sequence, needed some special tools. One of the most important of those was Sanger sequencing.

You’ve surely heard of Marie Skłodowska Curie, who won two Nobel prizes, in physics and chemistry. There’s a good chance you’ve heard of Linus Pauling (Nobels for chemistry and peace). I have to confess I had never heard of John Bardeen until I started writing this chapter. This is embarrassing, since it seems we all owe him an enormous debt for the work that won him two physics Nobels. Bardeen was co-inventor of the transistor, and developer of the theory of superconductivity — your phone only works because of Bardeen’s discoveries. The same goes for the computer on which I am typing this.

But for our purposes, there can be no doubt that Fred Sanger was the greatest of the quartet of double Nobel winners. Sanger, an Englishman, was a chemist who worked out how to determine the sequence of amino acids that make up a protein, for which he received his first Nobel. Sanger was a Quaker who received official conscientious-objector status during World War II, and a good thing, too — it would have been an appalling loss to the world if he had died during the war.

The protein Sanger chose to study first was insulin — the sugar-regulating hormone that is lacking (or ineffective) in people with diabetes. Insulin had been successfully used to treat diabetes since the early 1920s, and was one of very few proteins available in pure form in the early 1950s. The story of how that came to be is a special part of science history in itself.

Media stories of medical breakthroughs tend to be evenly divided between breathless reports about small improvements that happened several years ago and research in animals that might never translate into something relevant to humans. My PhD supervisor, Richard Harvey, and I were once interviewed on national TV news about research we hadn’t even done yet. We’d been awarded a grant from the National Institutes of Health in the US, and our institute’s public-affairs department had somehow sold this as a big news story to one of the TV networks. Years later, when we had done the research and had the results published, we couldn’t even get a local newspaper to give us a mention.

But there are some genuine stories of miracle cures in the history of medicine. The introduction of penicillin is one: deadly, incurable infectious diseases suddenly became curable. But for a real wonder drug, nothing beats the story of insulin.15

[7 Okay, there is one that’s better. Anaesthesia beats insulin. And I am definitely not saying that because my wife is an anaesthetist.]

Diabetes comes in two main flavours — and I use the term advisedly. Diabetes mellitus, by far the most common type, gets its name from a Latin word meaning ‘sweetened with honey’ — because the urine of sufferers tastes sweet. If you were to taste the urine of someone with diabetes insipidus, by comparison, you’d find it insipid: flavourless. The urine sommelier at your favourite restaurant would certainly not recommend it.

In turn, diabetes mellitus is divided into two broad groups, based on how well treatment with insulin works. If you have a deficiency of insulin, because your pancreas has stopped making it, then what you need is a replacement, which you can get in the form of an injection. This is the type we’ll be focusing on here. On the other hand, if your body makes insulin fine, but no longer responds normally to its effects, you have non-insulin-dependent diabetes, quite a different problem. As you might expect, there are all sorts of subtypes beyond this broad division. One rare type that affects newborn babies will make a cameo appearance later in the book.

The main job of insulin in the body is to give your cells the go-ahead to take up and use glucose from the bloodstream. If there’s no insulin around, it’s as if your cells are sugar-blind — they just can’t tell that the glucose is there, and they can’t do anything with it. Since the sugar isn’t being used, it builds up in the blood and spills into the urine, making it sweet — and dragging water with it so that you pee too much and get dehydrated. Meantime, your body is starving in the midst of plenty, because your cells can’t use the glucose that’s there.

By the early 20th century, it was already known that if you removed a dog’s pancreas, the animal would develop diabetes and would die within a fortnight. Unfortunately, much the same was true for people — diabetes, which mainly came on during childhood, was a death sentence. Affected people might linger on for a few weeks or months, but eventually they would slip into a coma and die.

There are several heroes in this story, all Canadians. Frederick Banting was a surgeon who had an idea about how to get an extract from the pancreas of a dog that might be used to treat diabetes. People had tried to do this previously, but nobody could get it to work.

As well as making insulin, the pancreas makes digestive enzymes, and Banting thought that, when people mashed up pancreatic tissue to try to extract insulin, those enzymes were coming into contact with the insulin in the mash and digesting it, so that there was none left to be extracted. His idea was to tie off the ducts that carry the digestive juices from the pancreas to the gut, causing the cells in the pancreas that made those enzymes to wither away. He hoped that, when this happened, you could mash up the remaining tissue, which would mainly be the insulin-producing cells without the enzymes, making it possible to get a pure preparation of the stuff they were after. He went to a leading diabetes researcher at the University of Toronto, John Macleod, who took some persuading but eventually gave him the resources he needed to give the idea a go, including ten dogs and an assistant, a medical student called Charles Best. Interestingly, it’s still true that, in the medical hierarchy, medical students rank slightly above domestic animals.

The story goes that Best was the winner of the luckiest coin toss in medical history. There were originally two students who could have been assigned to the project. Best and the other student, his friend Clark Noble, tossed a coin to see who would work with Banting, and Best won. Initially, they were going to swap halfway through the summer, but by then Best was entrenched in the work (and technically skilled at doing it), and they agreed that he should stay on. A share of scientific glory, on the toss of a coin.

It turned out that Banting’s idea was right. The initial work in dogs was so encouraging that, by January 1922, it was possible to start trials in humans. Another figure enters the story here: James Bertram Collip was the one who worked out how to purify the pancreas extract so that it could be safely injected into people. He was called in after the first recipient suffered from a severe allergic reaction when given the imperfectly purified extract of dog pancreas. There are stories from this time of extraordinarily dramatic recoveries. In particular, it’s said that once they had a pure preparation, Banting’s team went from bed to bed in a room full of comatose, dying children, giving injections. By the time they reached the last one, the first had already awoken from his coma.

If this story is true, you certainly can’t tell from the first scientific report of treatment with insulin, published in the Canadian Medical Association Journal. This paper is breathtakingly, gloriously dull. It’s not until halfway through the second page (in a paper only a little over five pages long) that there is even a mention of treatment in humans. The conclusions are cautious and measured: in essence, ‘we can measure some differences in the blood of patients, and they seem to feel better’.

Even if Banting and co. were reluctant to blow their own trumpets, word of such a discovery was bound to get out, and the news raced round the world. The very next year, Banting and Macleod were awarded the Nobel prize for medicine or physiology. Banting shared his prize money with Best; Macleod shared his with Collip. We can only imagine what it must have been like for families of newly diagnosed diabetics, after the news was out, but before large-scale insulin production was possible. There must have been many people who died despite the existence of a treatment, and others who were saved in the nick of time.

Be that as it may, by the time Fred Sanger needed a purified protein to study, 30 years later, all he had to do was stroll down to the local pharmacy and buy a bottle. Sanger’s approach was to simplify the problem — instead of trying to read the sequence of the entire protein, he would smash it into shorter bits that were easier to handle. He developed chemical methods to work out the sequence of amino acids in those short sections, then pieced together those overlapping short sequences to work out the overall sequence of the protein. It was an idea that had a long scientific reach. As we shall see, Craig Venter’s company, Celera, would use essentially the same approach to sequence the human genome, nearly 50 years later, and it remains an important technique in genetics. In 2018, the koala genome was sequenced this way.

Sanger’s discovery was not just ‘this is the sequence of insulin’. Important though that would have been, its impact was trivial compared with the discovery that proteins have a set sequence, on which their structure and function depends. Many later discoveries, including our understanding of the way that DNA codes for proteins, would have been impossible without this basic understanding of what a protein actually is — a chain of amino acid molecules.

For his next trick (and next Nobel Prize), Sanger figured out a way to read the sequence of DNA. His method, published in 1977, is still known as Sanger sequencing, and was the foundation stone on which the Human Genome Project rested. At first, Sanger sequencing involved a certain amount of messing around with radioactive isotopes, but later improvements involved tagging the DNA bases with different coloured fluorescent markers — safer, and also possible to massively scale up. And scaled up it was.

We still use Sanger sequencing in diagnostic labs. Here’s an example of what it looks like:

In this picture, the top section is from a carrier of a genetic condition and the bottom section is from someone who isn’t a carrier. You can’t tell from a black and white image, but the convention is that A is green, C is blue, G is black, and T is red. Each of the different coloured peaks represents one base of DNA, so you can use those colours to read the sequence from the peaks … G, G, T, A, C, T, and so on. Or you could just read the sequence of letters helpfully placed above the peaks, I suppose — but you see how it works.

In the middle of this stretch of DNA, between the two vertical lines, the normal sequence has a C; at the same place in the sequence above, if the image were in colour, you would be able to see that there’s a red peak there, a T — but if you looked very closely, you would see that the normal blue peak, the C, is there as well. This means that one of the two copies of the gene has the usual sequence, and the other has a change. Both copies, the one with and the one without the change, are in the sample and undergo the sequencing reaction. For the bases that aren’t changed, there’s no difference between the two, and the resulting trace looks the same as the normal sample. For the altered base, half of the DNA in the tube has a C and half has a T, and the two overlap — that’s why the peak at that place is about half the height of the C in the normal sample.

That’s easy enough to do now because we already know the sequence of the gene that we’re interested in. But it’s also possible to use Sanger sequencing to discover a DNA sequence where it wasn’t known before. You start from a known section and work your way along, discovering new territory as you go — until you meet up with someone coming the other way. This was the approach used by the Human Genome Project.

So, by the late 1980s, we had the tools that would be used to complete the job. Actually completing it, however, seemed a very long way away. In 1987, the US Department of Energy, an organisation not known to be daunted by large-scale ventures, launched an early version of the project. Their agenda was to find a way to protect the genome from the harmful effects of radiation — potentially important information at a time when nuclear power plants were an important and growing part of the country’s energy supply. By 1988, the National Institutes of Health had joined the DOE, and together they had received funding from the US Congress to attempt the task.

It would be a scant 12 years before this early vision would become reality. Yet for the first half of that time, from the perspective of an outside observer … almost nothing happened. In 1990, a goal of 2005 was set for completion of the project, but, given the expectation of ‘over time, over budget’ for government projects, not many outsiders took this very seriously. By 1994, the main achievement of the HGP was … a denser map. The new map of the genome had not 407 but 5,840 markers, densely spaced across the genome. To the outside observer, not that impressive, perhaps, but it was a critical step on the road to the genome. And in a sign of what was to come, it was delivered a year earlier than expected.

From very early on, the HGP was an international effort, with scientists from all over the world contributing. An Australian, Grant Sutherland, was president of the Human Genome Organisation (although not leader of the Human Genome Project) for part of the project’s life. James Watson (yes, that Watson) was the first leader of the HGP. In 1992, he resigned, and, after a brief interim, Francis Collins took charge, and would see the project all the way through to its end.

The actual sequencing was done by 20 institutes spread across six countries — the US, the United Kingdom, Japan, France, Germany, and China. Each was assigned a section to work on. The biggest single contributor outside the US was the Sanger Centre at Cambridge, named of course for Fred Sanger. The Sanger Centre, now the Wellcome Sanger Institute, sequenced almost a third of the human genome. Their assigned turf — some shared with other institutes — included chromosomes 1, 6, 9, 10, 11, 13, 20, 22, and X. The project seems to have gathered speed like the proverbial runaway locomotive. It wasn’t until 1999 that the first complete sequence of a human chromosome (chromosome 22) was published. In September 1999, more than a decade after the project started, a press release trumpeted the completion of 821 million bases of DNA sequence. Half of that was still in ‘draft’ form, and more than two billion bases still needed to be sequenced before the project reached its goal. But by June the next year, the project was near enough done for President Clinton and Prime Minister Blair to claim success.

The HGP took a logical, safe approach to sequencing the genome, taking what was known as ‘walks’ from known to unknown places. But they had a competitor — J. Craig Venter, the Elon Musk of genomics, who proposed a different approach altogether. Venter wanted to use a shotgun to get the job done.

Venter was a brilliant scientist with an entrepreneurial streak. After a less than stellar school career, from which he emerged a better surfer than scholar, Venter was drafted, and he served in the United States Navy during the Vietnam war. Experiences in a field hospital had a profound effect on him, and, when he returned, he studied medicine, but later switched to research. He proved an outstanding scientist, but, while working for the National Institutes of Health, became involved in a controversy over efforts to patent genes, and later left the NIH to go to the private sector — where he also excelled. As the first president of Celera Corporation, he decided to race the Human Genome Project, using a method the HGP had rejected — shotgun sequencing. This involves smashing the genome up into many small pieces, sequencing those, and then assembling them like a giant jigsaw puzzle.

Let’s say you do some sequencing and find that you have three fragments like this:

GGTGTGAACTGCCCCGAGGG
CCGAGGGCAGAGACCTCCCGTTTTG
CGTTTTGTTCTCCAGCGCCTTGAGCCAGC

With the right computing oomph behind you, it’s possible to put these together, like this:

GGTGTGAACTGCCCCGAGGGCAGAGACCTC
C
CGTTTTGTTCTCCAGCGCCTTGAGCCAGC16

[8 Another non-randomly chosen sequence: this is from NKX2-5, which also featured in my PhD and is one of my favourites for another reason. Animal geneticists in general, and fly geneticists in particular, have always been much better than human geneticists at naming genes. NKX2-5 is important in the development of the human heart. Flies don’t have much of a heart — just a tube that squeezes, really — but they have a gene that is very similar to NKX2-5. When that gene was found, it was discovered that flies that lack the gene don’t develop their tube-heart at all. The name of the fly gene? Tinman.]

The first chunk overlaps with the second, and the second with the third. Without the second, you’d have no way of connecting the first with the third. But keep on smashing and sequencing, smashing and sequencing, and eventually you’ll have enough overlapping bits that you can put the whole thing together. And in a genuinely astonishing achievement, Celera did exactly that. Pitted against 20 institutes in six countries and the US Department of Energy, Celera crossed the finish line in lockstep with the public HGP, which is why Venter shared the floor with Collins that day at the White House.

Celera did have one important advantage over the HGP — access to all of the public body’s data. From the beginning, one of the fundamental principles of the HGP was open access to data, setting the tone for a standard that continues in biomedical science to this day.

You may wonder what Celera hoped to gain by this effort. The initial plan was to discover and patent genetic sequences. Celera did file preliminary patents on 6,500 gene sequences, but in the end did not follow through with the patent process, and made their data freely available as well (including sending us the disk that, through no fault of theirs, we couldn’t interpret).

Even from the beginning, the ‘reference’ human genome was a mishmash of the genomes of different people — as it should be. The HGP called for volunteers from people living near the 20 sequencing centres. No personally identifying information was kept, and only a small proportion of the samples collected were actually used for sequencing, so nobody knows whose DNA is the ‘reference’. The ‘reference’ genome we work with today has been updated and adjusted using extra information from many different people; it’s a quilt made from many patches of cloth of varying sizes. Celera did something along the same lines, although perhaps less random. Twenty-one donors were enrolled. Apart from age, sex, and ethnic background (self-described), no information was kept about them. The volunteers had to provide 130 mL of blood (a little over a quarter of a pint, enough to make for quite a vivid crime scene). Males also provided five specimens of semen, collected over a six-week period (the section of Celera’s Science paper that describes their methods is strangely interesting). Of the 21 volunteers, just five people were chosen for sequencing. Two men and three women; one African American, one Chinese, one Hispanic Mexican, and two Caucasians.

It turned out later that one of the Caucasians, a male, contributed disproportionately to the effort, and was no longer anonymous: it was Venter himself. Only a few years later, Venter had the remainder of his genome sequenced, possibly the first individual human being to have this done. ‘Possibly’ because, at around the same time, James Watson had his genome sequenced, and it’s not clear which was finished first.

That was in 2007. At the time, sequencing an individual human’s genome was an astonishing idea. Now, it’s almost commonplace — you can have your own genome sequenced, if you have a few thousand dollars to spare and the inclination to do it. Hundreds of thousands of people have had their genome sequenced already.

A few thousand dollars? It cost about three billion US dollars for the Human Genome Project to produce the first-draft human genome sequence. It has been estimated that, in 2001, it would have been possible to sequence an individual human genome for about US$100,000,000. The cost has come down at a remarkable rate as the technology has advanced. Now, it’s possible to sequence a human genome for (notionally) less than $1,000 … and the cost is still falling. Analysing the data is already more of a challenge than generating the sequence. To put the fall in costs into context, imagine that sequencing a genome was a brand-new Lamborghini, retailing at $428,000. If Lamborghinis were to come down in price to the same degree that sequencing a genome has done, you’d now be able to pick up your shiny new car at the bargain basement price of $4.30.

Got a few bucks? Let’s take this baby for a spin!