The Princeton Guide to Evolution

V.3

Comparative Genomics

Jason E. Stajich

OUTLINE

1. Comparative genomics and genome evolution

2. Evolution of gene number

3. Identifying regulatory regions

4. Copy number variation

5. Rapidly evolving regions

6. Ultraconserved elements

7. The future of comparative genomics

Genome sequencing now makes it possible to inventory the genetic material of most living and, in some cases, ancient organisms. To date hundreds of genomes have been sequenced from bacteria and single-cell eukaryotes, as well as animals, plants, and fungi. The vast majority of sequences have come from microbes, including Bacteria and Archaea from diverse environments including those living in hot springs, inside animal digestive tracts, in soil, and in disease-causing organisms. The fungi are the most sampled eukaryotes, but dozens of animals and plants, with their large genomes, have also been tackled. The pace of this sequencing is still accelerating, and it is conceivable that many thousands of species will be sequenced in the next decade, with similar numbers of genomes from multiple individuals or populations of species. By comparing the genome sequences of species, we are learning how the genetic inventory changes over time. Those genes or DNA loci that have been maintained among many species often encode important or essential functions for cells, while those that change rapidly may be superfluous or provide a needed function only in select species. The ability to inventory and compare genomes requires computational tools and models of evolutionary change.

GLOSSARY

Alternative Splicing. The production of multiple transcripts from a single gene locus by the alternative inclusion of partial or entire exons through variation in splicing of the RNA.

C-Value Paradox. The idea that genome size scaled with complexity was found to be inconsistent when large genomes were identified for seemingly less complex organisms.

Negative Selection. Sometimes called purifying selection, it indicates that selection is purging changes that cause deleterious impacts on the fitness of the host.

Operon. A cluster of genes that function as a single unit under common regulatory control where the proteins often perform multiple steps in a common biochemical pathway.

Orthologue. A gene found in different species that evolved from a gene present in the common ancestor of the species.

Phylogenetic Shadowing. The identification of conserved blocks in sequences by aligning multiple sequences from different species to find regions of high conservation. Also called phylogenetic footprinting.

Positive Selection. Often called Darwinian selection, it describes selection where changes improve the fitness of the host.

Posttranscriptional Regulation. Regulation of the gene product after an mRNA is made. Mechanisms include premature degradation and delay or block of translation into protein.

Pseudogene. Gene locus with accumulated mutations and that is no longer functional.

Synteny. Shared gene order, where genes are arranged in the same order and orientation along chromosomes of difference species, indicating they were likely in the same order in the common ancestor.

Transposable Element. DNA segments that can relocate themselves in the genome.

Ultraconserved Elements. Conserved genomic regions stretching over hundreds of bases that are highly similar over millions of years of evolutionary time.

1. COMPARATIVE GENOMICS AND GENOME EVOLUTION

Rates of Change and Molecular Evolution

Studying the process and consequences of genome evolution involves comparing the sequences of modern-day organisms and reconstructing a history of events. The bases for the approaches are rooted in molecular evolution theory, which provides models and tools for the study of changes at the DNA or protein sequence level (see chapter V.1). Comparisons of genomes can identify regions that change, regions that are static or change less, regions that are unique to a specific lineage, or regions that are rapidly evolving. For any given region that is shared among species, one can count the number of differences and similarities over a stretch of DNA sequence and compute a rate of change. An evaluation of the entire genome allows a total average rate to be computed; individual regions can then be classified as faster or slower than an average or background rate. This classification of fast- or slow-evolving regions is useful, because it indicates the types of pressure exerted on the region by natural selection. There are three main classifications: negative selection, indicating slow to no change; neutral evolution, representing the average or background level of change without the influence of selection; and positive selection, indicating faster than the background rate of change (see chapter V.14).

Negative selection limits the mutations that fix in a region, most often because changes cause a decrease in fitness. As a consequence, a genomic region will appear to be evolving more slowly than the genome average because most mutations that occur are purged (e.g., individuals with specific changes in the region die or are less likely to reproduce). For example, a gene region that encodes an essential protein-coding gene, like histones, would be expected to be evolving slowly because any mutation that disrupted its function would render the organism inviable. This classification of negative selection is computed from alignments of a genomic region among multiple organisms. Regions under negative selection will typically have many fewer changes than are observed elsewhere in the genome. Regions with almost no changes between species are deemed ultraconserved and are discussed later in the chapter.

Regions that are evolving near the genome-wide average are considered to be evolving neutrally, and they typically do not encode functional elements on which natural selection would act. Examples include pseudogenes, which are inactivated or dead genes, and inactivated transposable elements. Pseudogenes that formed before a species split are particularly useful in trying to date the evolutionary divergence between organisms, because sequences are similar enough to align so the differences can be counted; since they are evolving neutrally, pseudogenes will have accumulated mutations at a rate proportional to the amount of time since the divergence of the species.

Genomic regions that are changing more rapidly than the background rate are considered to be under positive selection. It is not that these regions are experiencing a higher mutation rate than the rest of the genome but that mutations in the region are more likely to persist and be passed to the next generation because of the fitness advantage associated with them. Thus, these regions appear to change at a rate higher than the background rate for the genome. Since not every mutation that occurs will be fixed, those in neutrally evolving regions will be lost by genetic drift at a rate proportional to the effective population size of the species. But if a beneficial mutation occurs in a region and presents a fitness improvement that natural selection can act on, the change will be passed on to the next generation at a higher rate than neutral mutations, thus producing a higher rate of change overall. It is easier to detect and interpret positive selection in protein-coding genes where the impact is seen as higher frequency of amino acids changing mutations relative to mutations that do not change amino acids. Since mutations that change amino acids will alter the sequence and potentially the function of the protein, it is easier to observe and interpret these types of faster rates of change in gene regions. For genomic regions that do not code for proteins, the types of mutations that are beneficial are less defined, but overall faster rates of fixation indicate a relative importance of the region for fitness.

Gene Content Comparison

Sequence comparisons and evidence for selection are not limited to the level of DNA nucleotides. The gene content of an organism can also be important for interpreting its past and present ecological niche and types of competition pressures. An overall inventory of the genes with a shared history from a group of species can be obtained. This is done by first identifying homologous gene sequences using sequence searching tools such as BLAST (Basic Local Alignment Search Tool) to find significantly similar sequences. The genes are clustered by the significance of the similarity to find groups likely to be descendants from a copy that was present in the common ancestor of the species. These groups of shared genes can be automatically constructed from the total gene sets of multiple organisms using sequence similarity and clustering approaches that vary from simple distance methods to graph-theoretic approaches and phylogenetic tree construction. The tools build an overall collection of genes shared among organisms, and the clusters can be compared to see how the content varies among species.

With these collections of genes clustered into orthologous groups, one can investigate how genes are classified into particular phyletic patterns, those found in common among all species, or falling into groupings that represent particular hypotheses. For example, searching for genes found only in common among plant pathogens might provide ideas about genes involved in pathogenesis. The collection of genes shared among all or most species gives an idea about core processes that are essential for survival, since no organism has lost them.

Lineage-specific genes can be markers for specific processes unique to an organism (see chapter V.6); however, one must be cautioned not to overinterpret the set without some validation, as it is possible for some predicted genes to be false-positive gene annotations. Even though false positives are enriched in the lineage-specific gene data set, there will be interesting contents within the group. First, true lineage-specific genes may be rapidly evolving genes that have changed so fast as to lack identifiable sequence similarity with homologues in other species. In addition, novel genes may arise through other processes to create genes de novo even from random mutations. Lineage-specific genes may also be the result of transfer of a gene fragment from a genome parasite like a virus or transposable element.

The order of genes encoded on a chromosome can also contain information that can be compared among species. Shared gene order, or synteny, can be used to compare the content of genomes at an even finer scale than simple presence or absence of individual genes. The same gene order in different species typically indicates the genes were in the same order in the ancestor and would give further support to the idea that the genes are orthologous. In bacteria, shared gene order often indicates the related function of genes, since operons typically encode a cluster of genes in a common order to produce a common transcript. In some eukaryotes the existence of shared gene order often represents an ancient order that has not had sufficient time to degrade or indicates a region that is under selection to maintain the physical clustering of the genes. One example is the Hox gene cluster, which is a syntenic cluster found in animals that has persisted since the common ancestor of vertebrates and invertebrates.

2. EVOLUTION OF GENE NUMBER

Does having more genes mean a more complex organism? This was certainly the expectation as genome projects in model systems such as yeast, fly, worm, mustard weed, and even human sequencing were under way. A human gene count wager had even been initiated that varied in number from 15,000 to 80,000 genes based on the number of transcripts identified through mRNA sequencing; however, when the human genome was finished and annotated, it was clear that the number is closer to 23,000 genes and not much different from the number in the (seemingly) less complex worm or fly with roughly 20,000 genes. So how does gene number evolve, and how important is the gene count for predicting the capabilities of the organism? In short, the number of genes is probably less important than the diversity of ways that genes can be regulated (e.g., the temporal and spatial expression pattern), the number of interactions of the gene products, and the variation and modifications the transcripts can undergo as part of cellular processes.

Several caveats and definitions are required to fully explore the ideas of the way gene number evolves. First, a gene is a locus of DNA that encodes sequence that will be transcribed by an RNA polymerase. The produced RNA either encodes sequence for a protein, or the RNA itself will fold into a structure that allows it to do some work of the cell. The basis for gene count correlating with complexity originates from the “one gene, one enzyme” hypothesis that led to the Nobel Prize for Beadle and Tatum. The idea was put forth based on experiments in the fungus Neurospora crassa and showed that for their studies of enzyme pathways, one gene locus encodes only one functional product. The observed diversity of functional gene products and enzymes found in many organisms suggests that the number of genes would scale with complexity of the organism; however, comparisons from increasingly more genomes, sampling broadly from the diversity of eukaryotes, indicates lack of an overall correlation of gene count and organismal complexity. The idea that complexity did not scale with genome size had already been revealed by the C-value paradox (i.e., genome size does not scale with complexity), but it was surprising that even gene count did not scale as expected with complexity.

The source of the variation comes from the number and types of RNAs that can be made from the gene loci to be translated into protein or folded into regulatory modules. The number of these transcripts is controlled by the gene locus, but also by alternative splicing and posttranscriptional regulation of these products, which injects additional variation into the system. For example, it has been shown that alternative splicing is an important aspect driving diversity of transcripts in organisms classified as complex; nearly 94 percent of the genes in the human genome are alternatively spliced, while fewer than 1 percent are alternatively spliced in the yeast Saccharomyces cerevisiae. Additional regulation also takes place at the transcriptional level, where transcription factors and chromatin modifications can precisely regulate when an RNA is made or whether or not it is available for immediate translation or folding into a functional protein (see chapter V.7). Simply put, the complexity comes not from the building blocks (genes) alone but also from the number of ways they can be modified and made to interact, which controls the variable types of structures and behaviors in organisms we consider more complex.

It is worth noting that complexity is a loaded term that tends to assume that what we see as larger or more successful at colonizing the earth are in fact the more complex organisms. However, other measures, such as the total number of diverse tissue types, developmental stages, and elaborateness of body systems, are more objective measures that can be applied without anthropomorphic biases.

Despite these caveats, studying gene count differences is still useful for exploring recent changes among close organisms. Because increasing the copy number of a gene in the genome is one route to production of more of the gene product, gene family expansions can be successful routes to adaptation to an environment when more of something is needed. For example, drug resistance can evolve in microbes through the increase in copy number of pumps that shuttle the drug out of the cell. Studies of overall counts and examples of recent duplications can thus provide indications of recent changes that are important for adaptation to a new environment.

3. IDENTIFYING REGULATORY REGIONS

In addition to studies of the regions of protein-coding genes, comparative genomics can be useful for looking at genomic conservation around the coding regions of genes to identify regulatory regions. A challenge is that regulatory regions do not evolve under the same rules as the protein-coding part of a gene locus. The main features of regulatory regions are binding sites for transcription factors, which are proteins that can activate or repress transcription of the gene (see chapter V.8). A mechanism for identifying regulatory regions from comparative genomics utilizes the premise that regions that evolve more slowly than the background or neutral evolutionary rate are likely to encode a region that provides a function. Identifying regulatory elements is more difficult than gene regions because it may be that the specific order or orientation of the binding sites is not important, simply the presence of one or more of them.

Orthologous versus Coregulated Genes

An important aspect of the analysis requires choosing an appropriate collection of sequences for comparison. One approach is to compare the regulatory regions of orthologous genes in different species. A reasonable assumption is that when a gene is regulated for liver development in humans, it might have been under the same control in the ancestor of both humans and mice. Thus, there may be common binding sites found among species when comparing the same region in an orthologous gene. Gathering orthologues and searching for motifs found in common among the regulatory regions, or performing alignments on these regions, can identify elements that are in common and have been preserved among the gene copies. This approach may be successful for genes that have not had a change in gene regulation since divergence from a common ancestor. This orthology-focused comparison is useful for studying slowly evolving regulatory patterns that are probably essential or inflexible to change. It is unclear what proportion of genes in any given genome these encompass. There are examples of very conserved processes, such as mating in the yeasts of the Saccharomyces group, in which the regulatory modules have completely changed even though the same protein components are all involved.

If gene regulation components have changed, there are additional approaches that can be applied to identify regulatory motifs among genes from a single organism having a common regulatory pattern. The assumption here is that similarly regulated genes share a common collection of regulatory modules. The inference of gene regulation pattern can be made through comparing changes in gene expression over time during developmental processes or in response to various changes in conditions (e.g., temperature, nutrients). By gathering genes with a common gene expression pattern, one can search for shared sequence motifs that may explain their similar regulation. This approach focuses on motif identification in only one species, but if multiple parallel studies are performed in different species, a comparison of types of regulation can be elucidated.

Alignment-Based Regulatory Region Analyses and Phylogenetic Shadowing

To find binding sites or motifs that are in common among sequences, alignment of putative regulatory regions is undertaken to identify islands of conservation, which may be in a different order from the sequences compared. Computational approaches have been developed to identify alignable regions by finding small blocks of sequences without requiring regions to be in the same order or even orientation. Since identifying the general rules for evolution of regulatory elements is still an area of active research, it is not yet clear which heuristics are best, but it does appear that the main regulatory units are blocks of conserved sequences that can be shuffled in order and arrangement. The full extent of the relationship between binding site configuration and gene expression remains an enigma, so identifying “rules” for optimizing alignments of regulatory regions is still an area of active research.

A further extension of the alignment-based approach can be applied to pick out the more slowly evolving regions likely to be functional elements. Phylogenetic shadowing scores the rates of change on a per nucleotide basis so that regions that are evolving slower or faster than neutral rate can be identified. This analysis ideally starts with a pairwise or multiple alignment from several species of varying phylogenetic distance. The choice of species will influence the depth of conservation that can be identified. The boundaries of the regulatory elements are found by identifying stretches of sequences where the substitution rate, or average number of changes, is similar. This allows for the identification of boundaries of conserved regions based on the rates of change rather than only the best-scoring alignment blocks, providing more nuanced determination of the likely regulatory regions.

Motif Enrichment and Identification

Binding site motifs and regulatory regions can also be found by searching regulatory regions for short “words” that are overrepresented in genome sequences. Presuming that the binding site sequences are a shared feature of most gene regions regulated by a particular transcription factor, logically there should be one or many copies of the sequence within regulatory regions of genes that have a common expression pattern. Motif-searching tools use this assumption and compare the overrepresentation of a sequence motif of some specified size range (typically 6–12 base pairs long) in the simplest fashion; that is, by counting all the observed motifs and comparing the distribution of observed counts with a theoretical background distribution that takes into account overall sequence bias (such as G + C). An easy way to generate this null background distribution is to also sample motifs from a set of sequences that serve as a negative control group of nonregulatory sequences or, less ideally, simply from a collection of random sequences. Additional modern approaches that use information theory and phylogenetic relatedness can improve the accuracy of identified motifs. Experimental validation of the accuracy of these approaches to identifying truly functional elements is still ongoing research, but it appears that enrichment and phylogenetic conservation are very often strong indicators of true positive functional binding sites.

A twist on the motif enrichment approach is also to search for motif avoidance, where a functional element, if present, would negatively impact the fitness of the organism. For example, if binding sites for transcription factors were found within the protein-coding region of a gene, it would be deleterious as they might allow a factor to bind and prevent efficient transcription by RNA polymerase. As a consequence, there should be a paucity of binding sites in the wrong place. Studies of bacterial genomes provide support for this approach: when searching for the RNA polymerase binding sites, researchers found these motifs significantly underrepresented in the protein-coding regions compared with the upstream or downstream regions of a gene. This approach may additionally validate potential elements by determining whether the occurrence is restricted in some regions of the genome, indicating that the site has functional activity that must be limited to specific sequence contexts.

4. COPY NUMBER VARIATION

The genome is not a static entity; as such, there is not one version of a genome sequence that completely represents a species. Variation among individuals can be in the form of differing nucleotides, insertions, or deletions, and in the total number of copies of genomic regions. Copy number variation (CNV) can increase or decrease copy number of all or part of a chromosome. In humans, CNV has been linked to disease or abnormal phenotypes such as autism and cancer; however, a great deal of variation can be detected that is not associated with disease. Several studies in humans and other species have indicated that CNV can be adaptive. For example, in the diploid yeast species Candida albicans, CNV is often observed in response to adaptation to stresses like antifungal drugs; both losses and duplications of parts of chromosomes that contain a drug target or important drug pumps have been observed.

CNV in an individual can be detected with comparative genome hybridization using microarrays constructed with probes designed to be regularly spaced across each chromosome. These arrays can detect copy number by scoring the relative intensity of hybridization of a region in one sample as compared to a control. Because the probes are spaced evenly across a chromosome, the boundaries of the CNV can be mapped with relatively high precision. CNV can also be detected with next-generation sequencing by aligning the short reads back to genome sequence assembly and examining the depth of the reads. Regions with CNV will be higher or lower average coverage than the rest of the genome. As technologies continue to improve, we will gain a deeper understanding of the extent of CNV between individuals, populations, and species as well as its evolutionary implications.

5. RAPIDLY EVOLVING REGIONS

Comparison of rates of change across genomic regions can provide insight into their relative functional importance. As discussed previously, conserved regions evolving more slowly than expected likely encode functional elements needed for survival of the organism. The identification of regions evolving more quickly than the background rate can suggest changes that may be important in adaptation. Methods that search for these rapidly evolving regions of the genome identify where more changes than expected have accumulated by comparison to background rate of evolution. These methods can be used to test for faster changes across the entire genome, but they are most often applied specifically to protein-coding regions. By identifying fixed sequence differences between species within a protein-coding gene, tests can be performed to compare the number of DNA substitutions that change an amino acid (i.e., nonsynonymous) to the number of silent mutations (i.e., synonymous). Silent mutations occur because of the redundancy of the genetic code, so that several different codons encode the same amino acid residue. These approaches compare the rates of synonymous and nonsynonymous changes in a gene region, where the ratio of these rates can indicate the strength and direction of selection (see chapter V.14). Excess nonsynonymous mutations indicate positive selection, while an excess of silent changes indicates negative selection. A ratio close to 1 indicates a region evolving neutrally and can be considered the background rate, as one would expect for inactivated transposable elements and pseudogenes.

Searching for rapidly evolving protein-coding regions can identify those genes that may be involved in adaptation. Examples of these types of functions include genes involved in host-pathogen interactions, in evolution of drug resistance, and in organisms’ changes in lifestyles, food sources, or ecologies. For example, a plant chitinase, which degrades the chitin in an attacking fungal pathogen, shows an excess of amino-acid–changing mutations, suggesting that it has been evolving under positive selection. The changes identified occur in the active site of the enzyme and are interpreted to have improved the plant’s ability to further recognize and degrade the fungal chitin in response to avoiding defense strategies deployed by the fungus. Many other studies have identified positive selection, for example, due to sexual competition in cell-surface-recognition genes in plant pollen grains and in lysin, a sperm-recognition receptor protein in marine invertebrates.

Population genetic tests can also be used to find changes occurring among or between populations by examining changes in allele frequencies rather than fixed differences between species (see chapter V.14). For example, population analyses of Plasmodium falciparum, the causal agent of malaria, identified recent positive selection in genes that had acquired mutations conferring resistance to antimalarial drugs. Studies of human populations living in higher elevations of the Tibetan plateau revealed alleles under positive selection for transcription factors necessary for the hypoxic, or low oxygen, response. Both these studies found a change in the frequency of alleles between two populations differing in resistance or altitude, respectively; these data suggest that the changes in allele frequencies were driven by natural selection.

Recent efforts have also further identified nonprotein-coding genomic regions that represent RNAs having a functional role in gene regulation. Katherine Pollard and colleagues have identified fast-evolving regions in the human genome based on comparisons with a chimpanzee and a collection of other animal genomes. The study employed methods that estimate the rate of evolution of a sequence region on each branch of the phylogenetic species tree. The rates on each branch were compared with a likelihood ratio test to identify cases where the human branch evolved much faster than the background rate seen in the other species. The likelihood ratio is applied to eliminate cases where the region is rapidly evolving in general to make it possible to identify cases where only the human branch is faster. Using this approach, they found 49 “human-accelerated regions” (or HARS). For one of these regions, a noncoding RNA gene was identified and shown to be expressed in the brain. It has been proposed that this change may contribute to human intelligence; thus, the increased rate of change could be linked to increased fitness as intelligence developed.

6. ULTRACONSERVED ELEMENTS

In contrast to rapidly evolving regions, slowly evolving segments of the genome likely encode regions performing an essential function that permits few changes. Conveniently, these regions can be easy to identify, since they are stretches of DNA that have not changed substantially between organisms of increasingly distant evolutionary distance. To find these, simply applying parameters like sequence alignment windows of at least 100 base pairs (bp) and at least 90 percent identity should reveal loci that have changed little since divergence if the comparisons are between species with an average identity of less than 90 percent. Using this approach, researchers in several groups found long stretches of DNA significantly conserved among humans and other animals. The lengths and degree of conservation were much greater than would be expected given the time since divergence of the species, indicating the regions were under strong selection. One study found 481 ultraconserved segments between mouse, human, and rat that were 100 percent identical across at least 200 bp, and more than 5000 that were at least 100 bp long. The finding of nearly unchanged sequence regions for millions of years suggests strong functional significance for ultraconserved elements, but in many cases they did not appear to encode protein and only partially overlapped exons and introns of coding genes.

What functional role do these regions play? Through experimental manipulation and investigation of binding sites from the ultraconserved regions, it has been shown that some are functional regulatory elements; however, deletion of many megabases of DNA containing ultraconserved elements in mice showed no obvious fitness impact, indicating that the importance of these regions, despite evidence that they are under strong negative selection, still remains an enigma. Future experimental work and functional testing of these regions is needed to reveal the larger picture of how and why ultraconserved elements persist in genomes.

7. THE FUTURE OF COMPARATIVE GENOMICS

Comparative genomics utilizes evolutionary theory and computational techniques to provide a powerful tool that allows us to make sense of genome sequences. These approaches provide the means to generate experimentally testable hypotheses about likely functional elements, including the presence of regulatory regions, rapidly evolving regions that may be important for adaptation, and slowly evolving regions that may encode important but often not understood function. The ability to generate a sequence of whole genomes from nearly any organism is now within reach for most biologists, providing even more resources by which comparative genomics can shed light on the evolutionary history of the organisms and its features.