18
We have seen how by comparing overall similarities in DNA sequence we can reveal genetic relationships between different breeds that make sense. This can be taken to an even higher level by taking advantage of how chromosomes are organised. To see how, we return to the dance of the chromosomes we first encountered at the end of Chapter 16.
As the germ cells, the sperm and eggs, are formed, something very important occurs. After the music stops, the chromosomes move apart to continue their lives in isolation, waiting for the next time they can take to the floor. For the great majority of germ cells that time never comes. Only the chromosomes in the cells that go on to be fertilised will ever dance again, and for that they must wait until the next generation. The rest are discarded.
Under the microscope the chromosomes look the same after the dance as before. But, despite appearances, they certainly are not. During the brief encounter the chromosomes have broken apart, exchanged segments of DNA with their dancing partner and re-formed in new combinations. Each chromosome pair breaks and re-forms at a slightly different place so that, after the embrace, there are an almost infinite number of chromosome combinations in the germ cells, each one a different mosaic of paternal and maternally derived segments.
The whole point of this shuffling, and indeed the whole point of sexual reproduction itself, is to create genetic variability in the offspring. It’s this shuffling that makes sure that a child, or a puppy, never has exactly the same genetic make-up as his or her parents. There is a very good reason for going to all this trouble. We are, and have been throughout evolution, the target of pathogens. Bacteria, viruses, fungi and parasites of all sorts are constantly looking for an opportunity to invade our bodies. Were it not for our defences, we would soon be overwhelmed. Our immune system is the main defensive bulwark against pathogen invasion and it works silently and tirelessly to eliminate these threats from our system. Only when it is broken, by AIDs or immuno-suppressive drugs, do we appreciate how well our immune system has defended us, and how vulnerable we are to infection without it.
Pathogens themselves counter this solid defence by changing their genetic make-up. This is the reason for antibiotic resistance, for example, where a single bacterium mutates and is no longer killed by the drug. It multiplies and prospers among the dying legions of its contemporaries, and the infection takes hold again but this time it cannot be stopped by the antibiotic. Our defence against the pathogens’ counter-attack is to change our own genetic make-up at every generation, and we do this by gene shuffling.
Imagine splitting a pack of cards with just two suits of different colours, say hearts and spades, into separate piles with hearts in one, spades in the other. The order of the cards is the same in both, say ace to king in the usual way. These two decks represent a single pair of chromosomes before recombination. The order of genes is the same on both, but there are slight differences in their sequences. Now we cut both decks at the same place and swap them over. The numerical order of the recombined decks remains the same: Ace, two, three etc., to King, but at the cut, the colour changes from black to red in one pack and from red to black in the other. This represents the shuffling of one pair of chromosomes in one generation. There are already a lot of different possible combinations in the final decks and this number increases as we add in the other chromosomes. Even with this crude analogy it is easy to appreciate that the total number of possible combinations is vast. The shuffled chromosomes separate into different germ cells. Thinking of sperm, most will perish, but one will survive to fertilise an egg where it meets up with the shuffled deck from the mother. Repeated over the whole set of chromosomes, the possible combinations of genes are mind-bogglingly huge. Almost certainly that exact combination will not have been seen before in the whole history of a species. Importantly, and this is the point, pathogens will not have seen the same genetic combination before and will not have time to evolve a way of overcoming it.
Some plants, and a few animals, have done away with sexual reproduction altogether. The strategy can be very successful because they no longer need two sexes. Males are redundant. The glory is, however, short-lived. Sooner or later a pathogen will unlock the key to the defence system and, once accomplished in a single individual, will sweep through the entire population with devastating effect. As an example, South American bananas are propagated by cuttings and not sex, so that all banana plants are genetically identical clones. A fungus, Fusarium oxysporum, managed to unlock the defences in one plant, and is now sweeping through the entire crop.
I have taken rather a long time to explain the evolutionary reasons behind gene shuffling, but there is a very practical effect. Gene shuffling, or recombination to give its proper title, means that we can map and locate genes. In dogs, these might determine size, or coat colour, or tail shape, or perhaps a genetic disease like hip dysplasia. This is how it works.
Taking the risk of overworking the playing card metaphor, let’s suppose we want to find the gene responsible for one of these traits, as they are called. As a relevant dog example, we will take the form of hip weakness that affects several related pedigree breeds, like Labradors. We may think it’s caused by a faulty gene, but we have no idea which one. Sequencing the entire dog genome will indeed give us the sequence of DNA of the dysplasia gene, along with all 19,000 others. From this alone, there is no way of telling which of these genes is associated with the hip problem. You can make a stab at it by examining the sequence of genes that might be involved, like cartilage collagen genes, say, but it’s just a guess and may well be wrong. It usually is.
This is where recombination comes to our aid. Let’s say that, without looking, we insert a joker, representing the dysplasia gene, somewhere in the deck. I’ve already mentioned one benefit of the dog genome project, the discovery and location of tens of thousands of points (we called them SNPs) where the genome varies between dogs. They are known as genetic markers, and that is just what they are, molecular flags at intervals along the genome.
The likelihood, in a pedigree dog, is that the hip dysplasia mutation occurred just once and has spread to some of its dog descendants in succeeding generations. Although the initial mutation may well have happened many generations back, the joker (the card causing the disease) will still be flanked by the same genetic markers. Although we can’t recognise the joker, we can look to see which markers have travelled with it through the generations. If all or most of the dogs with hip dysplasia also have the same marker (the five of clubs, let’s say), then we can assume that the joker (the dysplasia gene) is nearby. We can then look carefully at those genes lying close to the five of clubs in our pack and, by sequencing them, locate not only the dysplasia gene but, if we are lucky, also the actual mutation in that gene which causes the disease. It’s a very powerful method and has been used, with minor variations, in identifying the genes for several inherited human diseases. In many cases the mutations were in hitherto unknown genes, so ‘guessing’ was never going to find them.
There is also an evolutionary history written in recombination. Going back to our metaphorical card decks for a moment, the cutting and shuffling that goes on in every generation will continue to mix the order of the cards. After just one generation, the recombined deck contains one long run of red cards followed by a long run of black. In successive generations the runs of colour will get shorter and shorter as the genomes mix.
This gradual mixing over time can be followed with the same genetic markers that were used to locate the dysplasia gene. It is a useful way of exploring how long a breed has been evolving. The longer that it has been closed to outsiders, as current registration rules insist, the more chromosome shuffling will have taken place. The runs of unmixed cards will get shorter and shorter over time. This means that a recently evolved breed will have longer stretches of chromosomes uninterrupted by recombination than will a long-established breed.
These segments are known as ‘haplotype blocks’ and their length can be measured. This approach can dissect the genetic structure within breeds even more closely than can the measures of overall genetic similarity method used by Bridgett vonHoldt and Robert Wayne in 2010. Seven years later, in 2017, Heidi Parker published her study of haplotype blocks.1 Unsurprisingly, the overall relationships between breeds closely resembled those drawn earlier with unsorted genomic data, but added some new twists. For the first time, we could see where two breeds shared haplotype blocks from a common ancestor even though they were not particularly closely related overall. This was delving into the genetic history of breeds not with a magnifying glass but with a microscope.
Closer inspection of the results adds a wealth of detail both about the history of domestication and breed formation in general and the peculiarities of particular breeds. Whereas the earlier unsorted genomic trees gave us a good idea of the overall genetic differences between breeds, the haplotype block approach adds a time element.
As generations pass, recombination slowly breaks up the linked haplotype blocks as the chromosomes are broken and re-formed. The positions of these break points are to all intents and purposes random, so that the haplotype blocks become gradually shorter and shorter over time. This means that by estimating the average length of shared haplotype blocks between two breeds we get a good idea of how long in the past they have been evolving separately.
The African Basenji shares the shortest segments of any with other dog breeds, confirming its reputation as a very unusual dog with a long independent history. Long used for hunting both by sight and scent in many parts of Africa, the Basenji has some distinctly odd habits for a dog. For a start, it does not bark. Instead it chortles and growls and occasionally starts to yodel. Unlike other dogs but in common with wolves, it breeds only once a year. It is very fastidious in its grooming, just like a cat, and also like a cat it is happy to spend all day just staring out of the window. Because of these unusual features and its superficial resemblance to images from ancient Egypt, many people have thought of the Basenji as a very ancient breed and perhaps the one most closely resembling the first domesticated dogs.
The haplotype-sharing genetic analysis certainly marks the Basenji out as a breed with an unusual evolutionary history, more or less separate from all other modern dog breeds. However, this does not necessarily mean, as some think, that the Basenji resembles the first domesticated dogs more closely than any other breed. Rather, it implies that its ancestry has been separate and it has been evolving on its own for a long time.
An interesting illustration of the effect of sorting breeds according to their shared haplotype blocks is found in the Eurasier. This is a breed of whose history we know a great deal. The breed was developed in Germany by Julius Wipfel and Charlotte Baldamus in 1960 for the express purpose of creating the perfect companion dog. Wipfel and Baldamus wanted a dog that showed some original wolfish features, not some pampered breed which had all the wildness bred out of them. They settled on crossing a Chow with a Wolfspitz. The crossing was successful and the Eurasier has since become a favourite companion dog with a devoted following, which included the Austrian animal behaviourist Konrad Lorenz. Lorenz believed that his Eurasier, Babett, had the best character of all the dogs he had known.
The two parent breeds came from entirely different backgrounds, the Chow from East Asia and the Wolfspitz from Germany. Heidi Parker and her team included a Eurasier in their project aimed at assessing the amount of hybridisation between different breeds. She did this by breaking down the dog genome into convenient haplotype blocks each containing a hundred SNP markers spread over at least 232 kilobases (kb). Then she measured the number of these blocks that were shared by pairs of dogs from different breeds. The rationale was that the higher the degree of sharing between the pairs of breeds, the more hybridisation had played a part in the breed’s formation. As one might expect, the degree of sharing between breeds of the same functional type (hunting, working etc) was much higher, about four times, than with breeds from different clusters in the original phylogram (Wayne/vonHoldt here). However, when it came to the Eurasier, Parker’s analysis found that the breed shared haplotypes with both the two ‘parent’ breeds, the Keeshund, closely related to the Wolfspitz, and the Chow. Because the origins of the two parent breeds were geographically so different, they had few haplotype blocks in common and the Eurasier formed a new clade (a group evolved from one common ancestor) on its own, lying between the two parent breeds on the resulting phylogram.
Most ‘regular’ breeds, by which I mean those formed from dogs of similar background, were confined to a single clade with strong statistical support. However, there were some breeds that did not fit neatly into well-defined clades, including the Belgian sheepdog, known as the Tervuren, the Cane Corso, sometimes called the Italian Mastiff, the unmistakable Bull Terrier and its miniature version, the feisty American Rat Terrier, and the American Hairless Terrier, a modern breed created from a bald Darwinian ‘sport’ picked from a Rat Terrier litter. These breeds lay only just outside the single clade rule by this test, but four other breeds stood out from the crowd. They were the American Redbone Coonhound, the Jack Russell terrier, the Sloughi, a sight-hound from North Africa, and the Cane Paratore, an ancient herding dog from the mountainous Abruzzi region of Italy. These misfits hint at a murky past, perhaps being not quite as pure as aficionados of these breeds might prefer. Other breeds that were adrift were either new breeds ‘under development’ in the sense that they had not been fully defined, or the same breed collected in different countries. The Cane Corso is a good example of the latter. In Italy, the birthplace of the breed, the Cane Corso forms a discrete clade, whereas in the USA the breed forms a mixed clade related to the, very similar, Neapolitan mastiffs. This all tells us that the genetic structure of these two closely related dog breeds has already begun to differentiate.
Parker and her colleagues took the next step, which was to examine the similarity of breeds, not just from the point of view of measuring the extent of gene-sharing, but looking at the surrounding areas on each chromosome. To explain this I will need to wind back for a moment. I have blithely referred to ‘gene-sharing’ without being clear what I meant by it. All dogs have the same basic set of genes which carry the information about how to build and run a dog from one generation to the next. Genes are, strictly speaking, bits of DNA that do something rather than just being there. Some dog genes control the structure of the eyes, others dictate the form of the bones or the length and colour of the coat. All dogs have these, so in that respect they share 100 per cent of their genes. That makes for a very dull comparison between breeds – they are all the same.
What I am really talking about is the sharing of different versions of the same gene. For instance, a black dog will have a ‘black’ version of a coat colour gene and a white dog will have a ‘white’ version of the same gene. The precise DNA sequence of the two versions will be different, and this difference can be detected in a way I will explain a little later.
I hesitate to introduce a new term because I know from teaching genetics for many years that it is a hard one to grasp. But here goes. The different versions of the same gene are called alleles, short for allelomorphs or ‘other forms’. When I refer to ‘gene-sharing’ between breeds, what I really mean is ‘allele-sharing’. For the rest of the book I will carry on using the misnomer ‘gene-sharing’ because this is most definitely not a genetics textbook, there is no exam at the end of term and I don’t want you to miss the point of what I am trying to get across by thinking ‘What on earth was it that he meant by allele?’ and thereby losing the thread.
This is also as good a time as any to say more about how DNA is analysed these days. As you know by now, DNA is literally a long string of four simple chemicals. Very long, actually. It is a code which gives instructions to cells to make this or that protein, like haemoglobin for blood, collagen for bone and all the thousands of other things a cell and a body needs. Like any written language, the sequence of the letters, or in this case the chemicals, is the message. There are only four, abbreviated to A, C, G and T, but even this very restricted selection makes for an infinite number of different sequences if they are long enough. And long they are. The entire dog genome, first sequenced in 2005, is roughly 2.8 billion letters (or ‘bases’ as a nod to their chemical structure) in total length. That’s about the same length as the human genome, but the DNA is spread, as we have already seen, between thirty-nine pairs of chromosomes rather than the twenty-three pairs we humans possess. As we have also seen, there are about 19,000 dog genes, about 3,000 fewer than we humans have.
The daunting task of actually sequencing DNA began in 1977 when the outstanding Cambridge molecular biologist Fred Sanger published the entire sequence of the simple virus called Phi X 174. Its genome, tiny in comparison to the dog or the human, was still 5,375 base pairs (bp) long. Sanger followed this up in 1981 with the first sequence of human mitochondrial DNA, which became the foundation for the whole field of using DNA to explore the past, something I have been doing for the past twenty-five years. I and thousands of other scientists owe Fred Sanger a great deal, and his contribution to science was recognised by the award of not one but two Nobel Prizes in Chemistry, once in 1958 for the amino-acid sequence of insulin (he bought his supply from the Cambridge branch of Boots the Chemists!) and again, in 1980, for his DNA work. I remember seeing Sanger give a lecture in Oxford around that time and thinking how very self-effacing he seemed. He hated giving lectures and, allegedly, employed a secretary specifically to decline on his behalf the myriad invitations to speak that inevitably follow in the wake of a Nobel Prize. Fortunately for me, he made a rare exception of the Oxford invitation.
Sanger’s method is still in use, and I still use it myself for mitochondrial DNA. Both the human genome and dog genome sequences were first completed using versions of the Sanger method automated on an industrial scale. Nevertheless, the former took fifteen years and cost over a billion dollars. The time and expense involved encouraged scientists to think of other, faster, cheaper ways of sequencing DNA and they came up with several ingenious methods. The eventual winner was invented by two French scientists, Bruno Canard and Simon Sarfari, then further developed by two Cambridge chemists, Shankar Balasubramanian and David Klenerman. In contrast to Sanger, and in a sign of the times, they founded a company, called Solexa, to commercialise their methods. Solexa was eventually acquired by the US tech company Illumina and it is they who now supply the sequencers and reagents for what has become known paradoxically as ‘next-generation’ sequencing.
Without going into too much detail, the Illumina method adds bases attached to fluorescent dyes to a growing DNA strand that is chemically tethered to a small glass flow-cell and copies the exact sequence of the original DNA. As the bases are added one by one they give off a burst of light which is captured by a camera. The colour of the flash depends on which base is added: blue for G, red for T, green for C and yellow for A. At the end of the reaction cycle there about ninety bases added to each tethered strand. The camera has recorded the order of fluorescent flashes – blue, blue, red, green and so on – in the same sequence that the bases were added, which directly translates to the sequence of the original. There could be millions of strands tethered to the flow-cell, each of them from a different fragment of DNA, and some pretty sophisticated software is required to sort out the flashes and assemble the short fragments into an overall sequence that, on a good day, covers the entire genome.
Thus in the forty years since Fred Sanger painstakingly sequenced the 5,000 or so DNA bases in Phi X 174, the latest Illumina machines can get through fifty complete genomes, each of three gigabases (3 x 109 bases), every day.