If you ever find yourself far from city lights, on a cloudless night with no moon, grab a blanket and lie on the ground and look up at the stars. It’s one of the most wonderful sights imaginable, and quite breathtaking for anyone who spends their life in a city. The glints of silver in the dark blanket of the heavens seem too many to count.
But – if you have access to a telescope, you realise that there is so much more in the firmament than you can detect with the naked eye. There are details like the rings of Saturn, and there are vastly more stars than we could ever imagine. There is so much more in the apparent darkness of the universe than can be seen just with our limited unaided vision. This becomes even more obvious if we use equipment that can detect the energy in the other parts of the electromagnetic spectrum, beyond just the visible wavelengths. More information keeps pouring in, from gamma waves to the microwave background. Those details and those stars have always been there, we just couldn’t detect them when we relied only on eyesight.
In 2012, a whole slew of papers was published that attempted to turn a telescope onto the furthest reaches of the human genome. This was the work of the ENCODE consortium, a collaborative effort involving hundreds of scientists from multiple different institutions. ENCODE is an acronym derived from Encyclopaedia Of DNA Elements.1 Using the most sensitive techniques available, the researchers probed multiple features of the human genome, analysing nearly 150 different cell types. They integrated the data in a consistent way, so that they could compare the outputs from the different techniques. This was important because it’s very difficult to make comparisons between data sets that have been generated and analysed differently from each other. Such piecemeal data were what we had previously relied on.
When the ENCODE data were published, there was an enormous amount of attention from the media, and from other researchers. Press coverage included headlines such as ‘Breakthrough study overturns theory of “junk DNA” in genome’;2 ‘DNA project interprets “Book of Life”’3 and ‘Worldwide army of scientists cracks the “junk DNA” code’.4 We might imagine that other scientists would all be congratulatory, and even grateful for all the additional data. And a lot were really fascinated, and are using the data every day in their labs. But the acclaim has been far from universal. Criticism has come mainly from two camps. The first is the junk sceptics. The second is the evolutionary theorists.
To understand why the first group were upset, we can examine one of the pithiest statements in the ENCODE papers:
These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.5
In other words, instead of being mainly dark sky with less than 2 per cent of the space occupied by stars, ENCODE was claiming that in our genome four-fifths of the celestial canopy is filled with objects. Most of the objects aren’t stars, assuming stars represent the protein-coding genes. Instead, they could be asteroids, planets, meteors, moons, comets and any other interstellar objects you can think of.
As we have seen, many research groups had already assigned functions to some of the dark area, including promoters, enhancers, telomeres, centromeres and long non-coding RNAs. So most scientists were comfortable with the idea that there was more to our genome than the small proportion that encoded proteins. But 80 per cent of the genome having a function? That was a really bold claim.
Although startling, these data had been foreshadowed by indirect analyses in the previous decade by scientists trying to understand why humans are so complicated. This was the problem by which so many people had been puzzled ever since the completion of the human genome sequence failed to find a larger number of protein-coding genes in humans than in much simpler organisms. Researchers analysed the size of the protein-coding part of the genome in different members of the animal kingdom and also the percentage of the overall genome that was junk. The results, which we touched on in Chapter 3, are shown in Figure 14.1.
Figure 14.1 Graphical representation showing that organismal complexity scales more with the proportion of junk DNA in a genome than with the size of the protein-coding part of the same genome.
As we have seen, the amount of genetic material that codes for proteins doesn’t scale very well with complexity. There is a much more convincing relationship between the percentage of junk in the genome and how complicated an organism is. This was interpreted by the researchers as suggesting that the difference between simple and complex creatures is mainly driven by junk DNA. This in turn would have to imply that a significant fraction of the junk DNA has function.6
Multiple parameters
ENCODE calculated its figures for level of function in our genome by combining all sorts of data. These included information on the RNA molecules that they detected. These were both protein-coding and ones that didn’t code for protein, i.e. junk RNAs. They ranged in size from thousands of bases to molecules a hundred times smaller. ENCODE also defined genome regions as being functional if they carried particular combinations of epigenetic modifications that are usually associated with functional regions. Other methodologies involved analysing regions that looped together in the way that we encountered in the previous chapter. Yet another technique was to characterise the genome in terms of specific physical features associated with function.*
These features varied across the different human cell types analysed, reinforcing the concept that there is a great deal of plasticity in how cells can use the same genomic information. For example, analyses of looping found that any one specific interaction between different regions was only detected in one out of three cell types.7 This suggests that the complex three-dimensional folding of our genetic material is a sophisticated, cell-specific phenomenon.
When looking at the physical characteristics that are typically associated with regulatory regions, researchers concluded that these regulatory DNA regions are also activated in a cell-dependent manner, and in turn that this junk DNA shapes cell identity.8 This conclusion was reached after the scientists identified nearly 3 million such sites from analysis of 125 different cell types. This doesn’t mean that there were 3 million sites in each cell type. It means that 3 million were detected when the different sites from each cell type were added up. Yet again, this suggests that the regulatory potential of the genome can be used in different ways, depending on the needs of a specific cell. The distribution of the sites among different cell types is shown in Figure 14.2.
Over 90 per cent of the regulatory regions identified by this method were more than 2,500 base pairs away from the start of the nearest gene. Sometimes they were far from any gene at all, in other cases they were in a junk region within a gene body, but still far from the beginning.
Figure 14.2 Researchers analysing the ENCODE data sets identified over 3 million sites with the characteristics of regulatory regions, when they assessed multiple human cell lines. The areas of the circles in this diagram represent the distribution of these sites. The majority were found in two or more cell types, although a large fraction was also specific to individual cell types. Only a very small percentage were found in every cell line that was analysed.
Most gene promoters were associated with more than one of these regions, and each region was typically associated with more than one promoter. Yet again, it appears that our cells don’t use straight lines to control gene expression, they use complex networks of interacting nodes.
Some of the most striking data suggested that over 75 per cent of the genome was copied into RNA at some point in some cells.9 This was quite remarkable. No one had ever anticipated that nearly three-quarters of the junk DNA in our cells would actually be used to make RNA. When they compared protein-coding messenger RNAs with long non-coding RNAs the researchers found a major difference in the patterns of expression. In the fifteen cell lines they studied, protein-coding messenger RNAs were much more likely to be expressed in all cell lines than the long non-coding RNAs, as shown in Figure 14.3. The conclusion they reached from this finding was that long non-coding RNAs are critically important in regulating cell fate.
Taken in their entirety, data in the various papers from the ENCODE consortium painted a picture of a very active human genome, with extraordinarily complex patterns of cross-talk and interactivity. Essentially the junk DNA is crammed full of information and instructions. It’s worth repeating the hypothetical stage directions from the Introduction: ‘If performing Hamlet in Vancouver and The Tempest in Perth, then put the stress on the fourth syllable of this line of Macbeth. Unless there’s an amateur production of Richard III in Mombasa and it’s raining in Quito.’10
Figure 14.3 Expression of protein-coding and non-coding genes was analysed in fifteen different cell types. Protein-coding genes were much more likely to be expressed in all cell types than was the case for regions that produced non-coding RNA molecules.
This all sounds very exciting, so why was there a considerable degree of scepticism about how significant these data are? Part of the reason is that the ENCODE papers made such large claims about the genome, particularly the statement that 80 per cent of the human genome is functional. The problem is that some of these claims are based on indirect measures of function. This was especially true for the studies where function was inferred either from the presence of epigenetic modifications or from other physical characteristics of the DNA and its associated proteins.
Potential versus actual
The sceptics argue that at best these data indicate a potential for a region to be functional, and that this is too vague to be useful. An analogy might help here. Imagine an enormous mansion, but one where the owners have fallen on hard times and the power has been disconnected. Think Downton Abbey in the hands of a very bad gambler. There could be 200 rooms and five light switches in each room. Each switch could potentially turn on a bulb, but it may be that some of the switches were never wired up (aristocrats are not known for their electrical talents), or the associated bulb is broken. Just because the switches are on the walls, and can be flicked between the on and off positions, it doesn’t actually tell us that they will really make a difference to the level of light in the room.
The same situation may take place in our genomes. There may be regions that carry epigenetic modifications, or have specific physical characteristics. But this isn’t enough to demonstrate that they are functional. These characteristics may simply have developed as a side effect of something that happened nearby.
Look at any photo of Jackson Pollock creating one of his abstract expressionist masterpieces.11 It’s a pretty safe bet that the floor of his studio got spattered with paint as he created his canvases. But that doesn’t mean that the paint spatters on the floor were part of the painting, or that the artist endowed them with any meaning. They were just an inevitable and unimportant by-product of the main event. The same may be true of physical changes to our DNA.
Another reason why some observers have been sceptical about the claims made in ENCODE is based on the sensitivity of the techniques used. The researchers were able to use methods that were far more sensitive than those employed when we first started exploring the genome. This allowed them to detect very small amounts of RNA. Critics are afraid that the techniques are too sensitive and that we are detecting background noise from the genome. If you are old enough to remember audio tapes, think back to what happened if you turned the volume on your tape deck really high. You usually heard a hissing noise behind the music. But this wasn’t a sound that had been laid down as part of the music, it was just an inevitable by-product of the technical limitations of the recording medium. Critics of ENCODE believe that a similar phenomenon may also occur in cells, with a small degree of leaky expression of random RNA molecules from active regions of the genome. In this model, the cell isn’t actively switching on these RNAs, they are just being copied accidentally and at very low levels because there is a lot of copying happening in the neighbourhood. A rising tide lifts all boats, but it will also raise any old bits of wood and abandoned plastic bottles that happen to be in the water as well.
This seems like a quite significant problem when we realise that in some cases the researchers detected less than one molecule of a particular RNA per cell. It isn’t possible for a cell to express somewhere between zero copies and one copy of an RNA molecule. A single cell either produces no copies of that particular RNA or it makes one or more copies. Anything else is like being ‘sort of’ pregnant. You’re either pregnant or you’re not, there’s nothing in between.
But this doesn’t actually tell us that the techniques used were too sensitive. Instead, it tells us that our techniques still aren’t sensitive enough. Our methods aren’t good enough to allow us to isolate single cells and analyse all the RNA molecules in that cell. Instead we have to rely on isolating multiple cells, analysing all the RNA molecules and then calculating how many molecules were present on average in the cells.
The problem with this is that it means we can’t distinguish between a large percentage of the cells in a sample each expressing a small amount of a specific RNA and a small percentage of the cells each expressing a large amount of the RNA. These different scenarios are shown in Figure 14.4.
The other problem we have is that we have to kill the cells in order to analyse the RNA molecules. As a consequence we only generate snapshots of the RNA expression where ideally we would want to have the equivalent of a movie so that we could see what was happening to RNA expression in real time. The problem inherent in this is demonstrated in Figure 14.5.
Ideally, of course, we should be able to test if the findings of ENCODE really stand up to scrutiny by direct experiments. But this takes us back to the problem that there were so many findings. How do we decide on candidate regions or RNA molecules to study? The additional complication is that many of the features identified in the ENCODE papers are parts of large, complex networks of interactions. Each component may have a limited effect on its own in the overall picture. After all, if you cut through one knot in a fishing net, you won’t destroy the overall function. The hole may allow the occasional fish to escape but losing one small fish probably won’t have too much impact on your catch. Yet that doesn’t mean all the knots are unimportant. They all are important, because they work together.
Figure 14.4 Each small square represents an individual cell. The figures inside the cell indicate the number of molecules of a specific RNA molecule produced in that cell. The researcher analyses a batch of cells, due to the sensitivity limits of the detection methods available. This means that the researcher only has access to the total number of molecules in the batch and cannot distinguish between (left) 36 cells all containing two molecules and (right) two cells (out of 36) each containing 36 molecules – or any other combination that results in an overall total of 72 molecules.
Figure 14.5 Expression of a specific RNA in a cell may follow a cyclical pattern. The boxes represent the points at which a researcher samples the cells to measure expression of that RNA molecule. The results may appear very different when comparing different batches of cells, perhaps from discrete tissues, but it may be that this simply reflects a temporal fluctuation rather than a genuine, biologically significant variation.
The evolutionary battleground
The authors of the ENCODE papers, and of the accompanying commentaries, also used the data to draw evolutionary conclusions about the human genome. Part of the reason for this lay in an apparent mismatch. If 80 per cent of the human genome has function, you would predict that there should be a significant degree of similarity between the human genome and at the very least the genomes of other mammals. The problem is that only about 5 per cent of the human genome is conserved across the mammalian class, and the conserved regions are highly biased towards the protein-coding entities.12 In order to address this apparent inconsistency, the authors speculated that the regulatory regions have evolved very recently, and mainly in primates. Using data from a large-scale study of DNA sequence variation in different human populations, they concluded that the regulatory regions have relatively low diversity in humans, whereas the diversity is much higher in areas that have no activity at all. One of the commentaries explored this further, using the following argument. Protein-coding sequences are highly conserved in evolution because a particular protein is often used in more than one tissue or cell type. If the protein changed in sequence, the altered protein might function better in a particular tissue. But that same change might have a really damaging effect in another tissue that relies on the same protein. This acts as an evolutionary pressure that maintains protein sequence.
But regulatory RNAs, which don’t code for proteins, tend to be more tissue-specific. Therefore they are under less evolutionary pressure because only one tissue relies on a regulatory RNA, and possibly only during certain periods of life or in response to certain environmental changes. This has removed the evolutionary brakes on the regulatory RNAs and allowed us to diverge from our mammalian cousins in these regions. But across human populations, there has been pressure from evolution to maintain the optimal sequence for these regulatory RNAs.13
Biologists tend to be a rather restrained social group when it comes to disagreements. There’s the occasional aggressive question-and-answer session at a conference but generally public pronouncements are carefully phrased. This is especially true of anything that is published, rather than said at a meeting. We all know how to read between the lines, of course, as shown in Figure 14.6 but typically, published papers are carefully phrased. That’s what made the debate that followed ENCODE particularly entertaining to the relatively disinterested observer.
The most forthright responses were mainly from evolutionary biologists. This wasn’t altogether surprising. Evolution is the biological discipline where emotions tend to run highest. Normally the bullets are targeted at creationists, but the Gatling guns may also be turned on other scientists. Epigeneticists working on the transmission of acquired characteristics from parent to offspring were probably quite relieved that ENCODE took them out of the firing line for a while.14
The angriest critique of ENCODE included the expressions ‘logical fallacy’, ‘absurd conclusion’, ‘playing fast and loose’ and ‘used the wrong definition wrongly’. Just in case we were still in doubt about their direction of travel, the authors concluded their paper with the following damning blast:
Figure 14.6 Scientists are usually outwardly polite (left-hand statements), but are sometimes just speaking in barely disguised code (right-hand thoughts) …
The ENCODE results were predicted by one of its lead authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass media hype, and public relations may well have to be rewritten.15
The main criticisms from this counter-blast centred around the definition of function, the way that the ENCODE authors analysed their data, and the conclusions drawn about evolutionary pressures. The first of these applied to the problems we have already described, using our Jackson Pollock and Downton Abbey analogies. In some ways, these problems derive in large part from difficulties in separating mathematics from biology. The ENCODE data sets were predominantly interpreted by the original authors through the use of statistical and mathematical approaches. The sceptics argue that this leads us down a blind alley, because it doesn’t take into account biological relationships, and that these are critically important. They use a very helpful analogy to explain this. The reason the heart is important is that it pumps blood around the body. That’s the biologically important relationship. But if we analysed the actions of the heart just by a mathematically derived map of its interactions, we would draw some ridiculous conclusions. These could include that the heart is present so that it can add weight to the body, and to produce the sound ‘lub-dub’. These are both things that the heart undoubtedly does, but they are not its function. They are just contingent on its genuine role.
The authors criticised the analytical methods because they felt that the ENCODE teams had not been consistent in the way they applied their algorithms. One consequence of this was that effects seen in a large region might weigh down an analysis inappropriately. For example, if a block of 600 base pairs was classified as being functional, when all the work was actually carried out by just ten of them, this would dramatically skew the percentage of the genome that would be designated as having a function.
The evolutionary argument was that the ENCODE authors ignored the standard model that regions with large amounts of variation are reflective of a lack of evolutionary selection, which in turn means they are relatively unimportant. If you want to overturn such a long-held principle, you need to have very strong grounds for doing so. But the critics claimed that the ENCODE papers, although containing huge amounts of data, had only focused on an inappropriately small number of regions when drawing evolutionary conclusions from the sequences of humans and other primates.
There are interesting scientific arguments on both sides, but it would be disingenuous to believe that the amount of heat and emotion generated by ENCODE has been purely about the science. We can’t ignore other, very human factors. ENCODE was an example of Big Science. These are typically huge collaborations costing millions and millions of dollars. The science budget is not infinite and when funds are used for these Big Science initiatives, there is less money to go around for the smaller, more hypothesis-driven research.
Funding agencies work hard to get the balance right between the two types of research. In many cases, Big Science is funded if it generates a resource that will stimulate a great deal of other science. The original sequencing of the human genome would be a clear example of this, although we should recognise that even that was not without its critics. But with ENCODE the controversy is not around the raw data that were generated, it’s about how those data are interpreted. That makes it different from a pure infrastructure investment in the eyes of the critics.
When all stages and aspects of ENCODE are added up, it cost in the region of a quarter of a billion dollars. The same amount of money could have funded at least 600 average-sized single research grants focusing on investigation of individual hypotheses. Choosing how to distribute funding is a balancing act, and at these levels of funding it is guaranteed to create division and concern.
A company called Gartner created a graphic that shows how new technologies are perceived. It is known as the Hype Cycle. At first everyone is very excited – ‘the peak of inflated expectations’. When the new tech fails to transform everything about your life there is a crash leading to the ‘trough of disillusionment’. Eventually, everyone settles down, there is a steady growth in rational understanding and finally a productive plateau is reached.
With something like ENCODE this cycle is extraordinarily compressed, because of the polarisation from the most vocal groups. Those scientists with inflated expectations are operating at exactly the same time as those in the trough. Pretty much everyone else is pragmatic, and will use the data from ENCODE when it is useful to do so. Which is usually when it can help inform a specific question that an individual scientist finds interesting.