17. Why LEGO is Better Than Airfix
Most children, and quite a few adults, enjoy making models. There are various ways of doing this but let’s just look at the extremes. One of the most popular formats in the UK for over 30 years was the Airfix kit. Small plastic parts specific to an aircraft, ship, tank or just about anything else you can think of (Bengal Lancer, anyone?) were supplied with detailed instructions. The user glued the parts together, painted them, applied transfers and admired the finished article for years after.
At the other extreme is that universal Danish toy of which I am so fond, LEGO. Although there are lots of specialist LEGO kits now, the concept remains the same as ever. A relatively limited number of components that can be joined together in any combination the user wants. And the model can always be split back down into its original bricks and reused to create something else.
Simple organisms like bacteria tend more to the Airfix way of life. Their genes are fairly set, coding for just one protein. The more complex an organism becomes, the more the genome begins to resemble LEGO, with a much greater degree of flexibility in how the components are used. And when we think how extraordinary we humans are, it seems reasonable to say, in a nod to a certain movie, that at the genomic level, ‘everything is awesome’.*
An extreme version of this phenomenon is the splicing that our cells can use to create multiple related proteins from one gene, as in Figure 2.5 (page 18). This ability to use the components of a gene in multiple different ways creates enormous flexibility and added opportunities for an organism. We can get some idea of the amount of variability that is possible by looking at some of the numbers involved. Human genes contain an average of eight amino acid-coding regions, each separated by an intervening stretch of junk DNA.* At least 70 per cent of human genes have been shown to create at least two proteins.1 This is achieved by joining up different amino acid-coding stretches. Using our DEPARTING example (as in Figure 2.5), this allows us to produce the protein DART but also the protein TIN. The ability to create different proteins this way is known as alternative splicing.
The regions that code for amino acids are short compared with the intervening junk regions. The stretches that code for amino acids have an average length of about 140 base pairs, but they can be surrounded by junk regions that are several thousand base pairs in length.2 About 90 per cent of the base pairs in a gene are from the intervening sequences, not from the amino acid-coding ones. If we think of this in terms of the English language, this immediately shows us some of the problems the cell faces.
Imagine you meet someone and are extraordinarily smitten by them. You have heard that they love poetry, and you want to sweep them off their feet, but you always skipped literature class in school. A friend gives you a piece of paper with a killer first line from a poem on it. But for some reason your vaguely sociopathic friend has split up the words of the first line among a load of gibberish, and you only have a couple of seconds to find the poetry, say it out loud and win someone’s heart (or at least their attention). Can you do it? Take a very quick look at Figure 17.1 and find out.
This is what our cells do all the time, every moment of every day of our lives. Machinery in the cell analyses a long stretch of apparent gibberish and almost instantaneously finds the hidden words and joins them together. You can take a look at Figure 17.2 to see if you managed to compete with the non-sentient proteins that keep you alive.
image
Figure 17.1 Have a quick look, can you find the line that will win someone’s heart?
In any long stretch of random letters there will also be combinations that spell words just by chance. Use these words by mistake when wooing (does anyone still woo?) the object of your desires and you may ruin your one chance for happiness. Figure 17.3 will show you how.
image
Figure 17.2 The words shown in bold and underlined should do the trick for you.
image
Figure 17.3 No! Bad combination!
By using this slightly bizarre example, we can understand some of the mechanistic challenges that our cells face when splicing RNA molecules properly. If we were designing this as a process, it would have the components shown in Figure 17.4.3 In addition to the components described in this diagram, it’s important to realise that different cells will handle the same gene differently, depending on the cell type and what is happening to it at any given moment. Consequently, all the stages have to be appropriately regulated and integrated so that the correct protein variants are made to meet the needs of the situation.
image
Figure 17.4 The sequence, reading from the top, lays out the steps that the splicing machinery has to be able to carry out to join up the appropriate amino acid-coding regions to create the correct mature messenger RNA.
The splice of life
This splicing of long RNAs to create smaller messenger RNAs that carry the information for specific proteins is a really complex process. It’s a very ancient system, and the components and steps have been maintained from yeast throughout the entire animal kingdom. It is carried out by a huge conglomeration of molecules called the spliceosome, which forms the splicing machinery. The spliceosome is composed of hundreds of proteins and also some junk RNAs, a little like the ribosomes that act as the factories to produce proteins.4
One of the critical stages is that the spliceosome wraps around the intervening sequences that need to be removed from an RNA molecule. It snips them out and then joins up the amino acid-coding regions. It’s an enormously complicated multi-stage process but we know that one of the first key steps is that the spliceosome needs to recognise the intervening regions, so that it can bind to them and remove them.
The beginnings and ends of these intervening sequences are always indicated by particular two-base sequences. Junk RNA molecules in the spliceosome can bind to these two-base sequences in much the same way as the two strands of DNA can pair up in our genes.
But there are only four bases in RNA, which means there are only sixteen two-base sequences (AC and CA are considered as different pairs, as are all the others). We would expect that the two-base sequences that mark the beginnings and ends of the intervening sequences would also be found elsewhere in these sequences, and also in the amino acid-coding regions. This is indeed the case. So although these two-base sequences are necessary for splicing, they aren’t sufficient on their own to direct the process properly. Other sequences are also required, as indicated in Figure 17.5.
The other sequences involved in selecting how splicing will take place are found in both the junk intervening regions and the amino acid-coding regions. Some of them influence splicing very strongly, others are more subtle. Some increase the chances of a splice event, others decrease them. They work in complex partnerships and the impact that they have on the final splicing pattern is affected by other things happening in the cell, such as the precise complement of proteins in the spliceosome. The descriptions that are used for these modifying sequences usually include such words as ‘dizzying’ or ‘bewildering’. These are geek speak for ‘unbelievably complicated, way beyond anything we can get our heads around or even design predictive computer algorithms for at the moment.’
image
Figure 17.5 Multiple sequences within an RNA molecule interact to drive splicing. The two-base motifs shown are necessary but not in themselves sufficient to regulate all the fine-tuning of this process. Other sites are involved, of varying strengths, as indicated by the different sizes of arrows.
Splicing and disease
We can get clues to the degree of sophistication by looking at a group of genetic diseases. These include a form of blindness called retinitis pigmentosa, which affects about one in 4,000 people. The blindness is progressive, often starting in the teenage years with a decline in night vision, and then becoming steadily worse and more disabling with age. The loss of vision occurs because the cells in the eye that detect light gradually die off.5 About one in twenty cases is caused by a mutation in one of five proteins involved in a specific step in splicing.6,7,8,9 The mutation only causes a deficit in the cells of the retina, and not in all the other cells in the body which also rely on splicing. This shows us that splicing is under complex cell- and gene-specific control, in ways that we haven’t yet been able to understand.
By contrast, there is a very severe form of dwarfism with other unusual features such as dry skin, sparse hair, seizures and learning disabilities. Affected children almost always die before they are four years old.10 It’s very rare except in the Ohio Amish community, where 8 per cent of the people are carriers. That’s because the mutation that causes this condition was present in the small number of families that founded this community. It isn’t found in other Amish groups such as those in Pennsylvania, which were founded by other families. When the mutation that causes this condition was identified, the researchers first thought that it was changing the amino acid sequence of a gene that codes for a splicing protein. But we now know that the change actually disrupts the three-dimensional structure of a junk RNA that forms part of the spliceosome.11 Unlike the retinitis pigmentosa situation, this defect in the action of the spliceosome causes a very wide-ranging set of symptoms, possibly by causing mis-splicing of lots of different genes.
Human disorders don’t just occur because of defects in the splicing machinery. They can also arise because protein-coding genes themselves have mutations in sites that are important for the control of splicing of the RNA from that single gene. Some authors have claimed that up to 10 per cent of human inherited disorders may be caused by mutations at the splice sites, those two-base sequences shown in Figure 17.5.12
One example of this mechanism was a family in which two young siblings developed intractable diarrhoea within a few days of birth. Medical staff managed to stabilise the children, but the diarrhoea persisted for many months and one of the two affected children died at seventeen months of age. When the genomes of the children were sequenced, the researchers found a mutation in a splice site in a gene, changing one of the GU sequences shown in Figure 17.5. This resulted in the splicing machinery skipping over an amino acid-coding region inappropriately. Essentially, an amino acid-coding region was left out of the protein, and as a consequence the protein could no longer do its job.13
Kaposi’s sarcoma is a cancer that first came to public attention when it was found at high levels in people with AIDS. AIDS is caused by the human immunodeficiency virus (HIV) and the effect of the HIV infection is to suppress the immune system. Kaposi’s sarcoma is caused by a different virus called HHV-8. Normally our immune systems control this virus but if the immune system is seriously below par, HHV-8 can become established and trigger Kaposi’s sarcoma.
HHV-8 is present in a high percentage of people in the Mediterranean basin, but Kaposi’s sarcoma is rare in this population, and almost never found in small children. So medics were very surprised when a Turkish family brought in their two-year-old daughter who had a classic lesion characteristic of this cancer on her lip. The cancer spread rapidly and aggressively and the little girl died just four months after she was first diagnosed.
The child was negative in all tests for HIV. Her parents were related to each other, a first-cousin marriage. Researchers looked for genetic reasons why the daughter might have an impaired immune response to HHV-8.
By sequencing DNA obtained from samples that had been taken from the deceased girl, scientists identified a mutation in a splice site of a specific gene. The mutation changed an AG to an AA, which meant the spliceosome could no longer recognise where it was meant to cut the RNA molecule. The result was that a junk region that should have been removed was retained in the messenger RNA molecule. This messed up the sequence, creating a stop signal much too early in the messenger RNA. This prevented the ribosome from making the full-length protein. Because the protein is one that is required for mounting a good immune response to viruses such as HHV-8, the child with the mutation was very susceptible to Kaposi’s sarcoma.14
Although splice site mutations are relatively common, genetic diseases are more often caused by mutations in the amino acid-coding regions of genes. Some of these cause problems because they introduce stop signals that prevent the ribosomes from making full-length proteins from messenger RNA templates. Other mutations may change the code from one amino acid to another. For example, CAC codes for the amino acid histidine whereas CAG codes for glutamine, a different amino acid. But researchers have speculated that up to 25 per cent of the mutations that change the amino acid in this way also influence the splicing of nearby regions in the messenger RNA. In some cases the disease may be due not to the single amino acid alteration per se, but to the variation that the nucleotide change creates in the way a messenger RNA is spliced.
The problem is that it is very difficult to demonstrate that this is the case in most situations. Even if we can show that the change in the RNA leads to both an altered splicing pattern and an amino acid change, how can we tell which effect causes the disease symptoms? Are these due to protein with one altered amino acid, or because the protein has also been spliced in an unusual pattern?
Nature has actually provided us with proof that sometimes a mutation in a coding region can cause a disease by influencing splicing, rather than by changing an amino acid. There is an extraordinary disorder called Hutchinson-Gilford Progeria, named after the two scientists who first identified it. Progeria means early ageing and this particular form is incredibly dramatic. It is also extremely rare, affecting about one in 4 million children.15
Affected babies seem perfectly healthy at first but within a year their growth rate slows dramatically, and they remain underweight and short for the rest of their lives. The children begin developing many symptoms of old age, including thinning hair, stiffness and baldness. Although there are some ageing conditions that they don’t develop, such as Alzheimer’s disease (and the children also don’t have learning disabilities), the affected individuals do develop severe cardiovascular disease. This is usually what causes death by the early teens, as a consequence of heart attacks or major strokes.
In 2003 researchers identified the gene mutation that causes Hutchinson-Gilford Progeria. Every patient they tested had a de novo mutation, meaning one that developed spontaneously in the parents’ egg or sperm. Incredibly, in eighteen unrelated patients (out of twenty who were assessed) the mutation was exactly the same.16
A sequence that should read GGC in a particular gene had mutated and now read as GGT. This mutation was in the amino acid-coding part of the gene. This might seem like a straightforward case of a mutation changing an amino acid in a protein, so of course the first thing to do is to look at the genetic code and see what these two sequences code for. GGC, the normal sequence, codes for a simple amino acid called glycine. But the mutated sequence, GGT, codes for – wait for it – glycine. Yep, same amino acid.
This is because our genetic code has a level of redundancy. Our genome is composed of four letters – A, C, G and T (or U in RNA). Blocks of three letters are used to code for an amino acid. There are 64 possible combinations of three from four letters. Three of these combinations are stop signals, telling the ribosomes not to add any more amino acids to a protein chain. This leaves 61 combinations to code for amino acids. But our proteins only contain 20 different amino acids. So some amino acids can be coded for by different three-letter combinations. At one extreme, glycine is coded for by GGA, GGC, GGG and GGT(U). At the other, the amino acid methionine is only coded for by the combination AT(U)G.
But if the amino acid sequence encoded by the mutated gene doesn’t change in Hutchinson-Gilford Progeria, what causes the dramatic phenotype in this condition? Look again at Figure 17.5. The two-base sequence at the beginning of each intervening junk region within a gene is GT. In the patients where the normal GGC changes to GGT, the amino acid region gains an inappropriate extra splice signal. In the context of all the other splicing signals in that genomic region, this inappropriately positioned GT acts strongly. The spliceosome cuts the messenger RNA in the amino acid-coding region rather than in the junk region. The amino acid-coding regions join up badly and the end result is a loss of about 50 amino acids from the end of the protein. This in turn means that the protein itself isn’t processed properly, and it begins to wreak havoc in the cells. We still don’t know exactly how this leads to the extraordinary ageing we see in these children, but our best guess at the moment is that the cell nucleus isn’t maintained properly. This may lead to changes in gene expression and nuclear breakdown. Some genes and some cell types may be more sensitive to this than others.
There is another condition that affects young children called Spinal Muscular Atrophy. In this condition, the nerve cells supplying the muscles gradually die off, leading to muscle wasting and loss of mobility. There are a number of different forms, and in the most severe version the life expectancy for affected babies is very low, less than eighteen months.17 It is relatively common for a genetic disease: in the UK about one in 40 people is a carrier, meaning about 1.5 million of us have one defective copy of the gene. Luckily, both copies of the gene have to be mutated for symptoms to develop.18
Spinal Muscular Atrophy is caused by the deletion or loss of function of a gene called SMN1. If we look at the human genome we might be surprised that this has such a major effect, because there is another gene that codes for exactly the same protein. This gene is called SMN2. This raises a really obvious question: since they code for the same protein, why can’t the SMN2 gene compensate for a damaged/deleted SMN1 gene?
In a rather similar way to Hutchinson-Gilford Progeria, the SMN2 has one subtle variation from SMN1. This is a change in DNA sequence in an amino acid-coding region. It doesn’t change the amino acid sequence, because it is in one of the three-base sequences with redundancy in terms of amino acid coding. Instead it changes one of the sites that help ribosomes to work out where to splice the messenger RNA molecule.19 It doesn’t change a splice site, it changes one of the sites that influences where splicing occurs. This results in skipping of an amino acid-coding region, and the production of a protein that isn’t functional. Because of this, the SMN2 gene can’t compensate for the malfunctioning SMN1 gene. The normal SMN1 protein is required for spliceosome activity. So essentially, a mutation in one gene leads to problems with overall splicing of messenger RNAs, and this could be overcome except for an independent splicing issue with a potentially compensatory gene.
Manipulating splicing for therapeutic gain
As we saw in Chapter 7, in Duchenne muscular dystrophy, the severe muscle wasting disease carried on the X chromosome, the dystrophin gene is mutated (see page 92). This gene is exceptionally large, stretching for almost 2.5 million base pairs. It contains nearly 80 amino acid-coding regions, which need to be spliced and processed properly. This is particularly important because the dystrophin protein is long-lasting. This means that any change that increases the chance of mis-splicing will affect the cell for a long time. But the presence of 78 introns in this massive gene means there is a high risk of spontaneous and inherited mutations which could affect splicing, just because there are so many opportunities for this to happen. As one review rather pithily puts it: ‘The massive (2.4Mb) dystrophin gene, most of which is in its 78 introns, is a splicing accident waiting to happen, and it does so with an incidence of 1 in 3,000 live births.’20
So, some cases of Duchenne muscular dystrophy are caused by splicing defects. However, in a large number of cases the disease is caused because critical regions of the gene, and hence the protein, are missing. But in recent years there has been a glimmer of hope for developing treatments for this invariably fatal disease. Perhaps counterintuitively, this has been based on developing drugs which encourage abnormal splicing in the dystrophin gene in affected boys.
The dystrophin protein acts like a kind of shock absorber in muscle cells. We can think of the dystrophin molecules as being like the springs in a mattress. In order for the mattress to remain supportive, the springs need to attach to the top and bottom of the mattress. If there has been a manufacturing fault, and the springs have been produced without the last ten centimetres, they won’t be able to attach to the top of the mattress. The more often you use the mattress, the less supportive and more distorted it will become.
It is quite common that Duchenne muscular dystrophy is caused by loss of internal regions of the dystrophin gene. When the gene is copied into RNA, the remaining regions are spliced together. Compared with the normal dystrophin gene, the mutant gene is now lacking some amino acids in the interior of the protein. But that’s not what really causes the biggest problem, as shown in Figure 17.6.
The amino acid code is read in blocks of three bases, as we have already seen. When the correct amino acid-coding regions (known as exons) join together, as in the normal gene, they produce a long messenger RNA molecule that codes for lots of amino acids. But if wrong exons join together, they may be out of sync with each other, so that the blocks of three don’t run properly. A really simple example would be the following:
YOU MAY NOT SEE THE END BUT TRY
If we lose one letter, we rapidly start to lose the sense:
YOU MAY OTS EET HEE NDB UTT RY
This is called a frame shift. In messenger RNA the first impact is that the wrong amino acids are inserted into the growing protein chain. But quite soon, something even more dramatic happens. There will be a combination of three letters that act as a stop signal. At that point the ribosomes will cease adding amino acids, and the mutated protein is truncated.
This is what happens in the patients with deletions of certain regions of the dystrophin gene. In Figure 17.6, the frame of reading the three-base combination is indicated by the numbers under the boxes. As long as the number at the end of one box and the beginning of the next add up to three, the ribosome can keep reading the messenger RNA. But where the most common deletion occurs, this introduces a frame shift, which rapidly leads to a stop signal and a severely shortened protein chain.
One way around this would be to encourage the cell to skip one of the amino acid-coding regions after the deletion, because this would restore everything to the right reading frame. The end result would be a protein that has a bit missing internally, but can still function reasonably well. This might slow down progression of symptoms. This is shown using our bed spring analogy, in Figure 17.7. The dystrophin molecule will still be able to connect to the necessary proteins at either end. It won’t be quite as good at absorbing shocks as the full-length protein. But it will be a lot better than one that can’t tether to the necessary cellular structures.
image
Figure 17.6 Representation of a key region where a mutation in the dystrophin gene can lead to a severely shortened protein molecule because of a shift in amino acid reading patterns when amino acid-coding regions 48 to 50 are lost from the DNA. In order to maintain the reading pattern, the numbers under each boundary must add up to three. If region 51 could be skipped in the mutant gene, the reading sequence would be restored. For simplicity, all the amino acid-coding regions have been drawn as the same size, although in reality they vary from one another.
image
Figure 7.7 Schematic showing how a mutant dystrophin protein is unable to attach to the two sides of the cell membrane. The skipped version of the mutant protein, which is missing some internal sequences, can attach to the two sides of the membrane. Because it is shorter it isn’t as good a shock absorber as the normal protein, but it is much better than the original mutant.
The supporting evidence for this hypothesis looked good, and biotech companies began programmes to try to find a way of exploiting this knowledge. A company called Prosensa developed a drug that helped muscle cells to skip over amino acid-coding region 51 and eventually licensed this experimental drug to the pharmaceutical giant GlaxoSmithKline. In April 2013, GlaxoSmithKline published the results of a small trial in boys with the relevant form of Duchenne muscular dystrophy. Fifty-three boys were randomly assigned to two groups. One group received the drug, the other went through exactly the same procedures, but without actually receiving any of the experimental drug. This is known as a placebo procedure and is an important way of controlling for effects in clinical trials that are due not to the drug but to other effects such as increased optimism or patients getting better independently of the drug. The boys were tested 24 and 48 weeks later. The test measured how far they could walk in six minutes.
After 24 weeks, the boys who received placebo had got worse, as we would expect for this disease. They couldn’t walk as far as when they entered the trial. But the boys who received the drug could walk more than 30 metres further than when the trial started. The boys were tested again after 48 weeks. The placebo group had deteriorated even more. In the six-minute walking test, their performance was almost 25 metres less than when the trial started. The boys who had been treated could walk over eleven metres further than when the programme began.21
These data showed that, over time, even the boys who received the drug began to decline (look at the difference between 24 and 48 weeks), but this decline was dramatically slower than when the condition was running its normal course.
The results from this trial caused enormous excitement. Finally, it looked like there might be hope developing in the treatment of a previously intractable disorder. Even if the treatment didn’t cure the patients, it might significantly slow down the development of the irreversible symptoms. This was what everyone researching in the field, and the families of affected boys, had been working towards for decades. True, it wouldn’t work for all Duchenne sufferers, but between 10 and 15 per cent of patients were expected to be eligible for this approach, based on the kind of mutation in their dystrophin gene.
Just six months later, those hopes were in tatters. GlaxoSmithKline ran a larger trial and this time couldn’t find any significant difference between the treated and untreated groups.22 The results from larger trials are more reliable than ones from smaller studies because they are less likely to be affected by odd patterns that look like a response but aren’t. GlaxoSmithKline had no doubts about its large trial, convinced that if there had been a genuine effect of the drug, it would have been detected. They handed the drug back to Prosensa and walked away. Prosensa is continuing with clinical studies, although its share price tanked after GlaxoSmithKline departed, reflecting concerns by analysts that this programme may be doomed.
There is another company that is also trying to exploit splicing patterns to leap over the troublesome region in the dystrophin gene in the same patient groups. This company is called Sarepta, and it is using a similar approach to treating the affected boys. Although the company remains very upbeat about its programme, the Food and Drug Administration has questioned whether its trials are large enough to give genuinely conclusive results. For example, one of the studies in which a dramatic difference between the untreated and treated groups was seen only contained twelve patients.
Investors in the companies are no doubt feeling a chill breeze, but it can’t begin to compare with what the families of affected boys must have gone through and be going through every day.
It would be tempting to look at the science in this chapter and decide that splicing is more trouble than it’s worth. It certainly seems to be an example of Sod’s law – if something can go wrong, it will. But the reality is that the same is true of almost every biological process. Billions of bases, thousands of genes, trillions of cells, billions of people. It’s a numbers game; nothing goes right every time. But the fact that this process of joining together split genes has been maintained through hundreds of millions of years of evolutionary history, using a highly conserved system, makes it pretty clear that the advantages of the sophistication, additional information content and sheer flexibility more than compensate for the off days.
image
* If you haven’t already, go and see The Lego Movie, it’s excellent.
* As a reminder, the junk regions between amino acid-coding regions are known as introns. The amino acid-coding parts themselves are known as exons.