Top texture: © Laguna Design / Science Source;

Chapter 6: Clusters and Repeats

Chapter Opener: © Martin Shields/Science Source.

6.1 Introduction

A set of genes descended by duplication and variation from a single ancestral gene is called a gene family. Its members can be clustered together or dispersed on different chromosomes (or a combination of both). Genome analysis to identify paralogous sequences shows that many genes belong to families; the 20,000 or so genes identified in the human genome fall into about 15,000 families, so the average gene has about 2 relatives in the genome. Gene families vary enormously in the degree of relatedness among members, from those consisting of multiple identical members to those for which the relationship is quite distant. Genes are usually related only by their exons, with introns having diverged (see the chapter titled The Interrupted Gene). Genes can also be related by only some of their exons, whereas others are unique.

Some members of the gene family can evolve to become pseudogenes. Pseudogenes (ψ) are defined by their possession of sequences that are related to those of the functional genes but that cannot be transcribed or translated into a functional polypeptide. (See the Genome Sequences and Evolution chapter for further discussion.)

Some pseudogenes have the same general structure as functional genes, with sequences corresponding to exons and introns in the usual locations. They might have been rendered inactive by mutations that prevent any or all of the stages of gene expression. The changes can take the form of abolishing the signals for initiating transcription, preventing splicing at the exon–intron junctions, or prematurely terminating translation.

The initial event that allows the formation of related exons or genes is a duplication, when a copy of some sequence is generated within the genome. Tandem duplication (when the duplicates are in adjacent positions) can arise through errors in replication or recombination. Separation of the duplicates can occur by a translocation that transfers material from one chromosome to another. A duplicate at a new location might also be produced directly by a transposition event that is associated with copying a region of DNA from the vicinity of a transposable element. Duplications of intact genes, collections of exons, or even individual exons can occur. When an intact gene is involved, duplication generates two copies of a gene whose activities are initially indistinguishable, but then the copies usually diverge as each accumulates different substitutions.

The members of a structural gene family usually have related or even identical functions, although they might be expressed at different times or in different cell types. For example, different human globin proteins are expressed in embryonic and adult red blood cells, whereas different actins are utilized in muscle and nonmuscle cells. When genes have diverged significantly or when only some exons are related, their products can have different functions.

Some gene families consist of identical members. Clustering is a prerequisite for maintaining identity between genes, although clustered genes are not necessarily identical. Gene clusters range from the extreme case in which a duplication has generated two adjacent related genes to cases in which hundreds of identical genes lie in a tandem array. Extensive tandem repetition of a gene can occur when the product is needed in unusually large amounts. Examples are the genes encoding rRNA or histone proteins. This creates a special situation with regard to the maintenance of identity and the effects of selective pressure.

Gene clusters offer us an opportunity to examine the forces involved in evolution of the genome over regions larger than single genes. Duplicated sequences, especially those that remain in the same vicinity, provide a means for further evolution by recombination. A population evolves by the classical homologous recombination illustrated in FIGURE 6.1 and FIGURE 6.2, in which an exact crossing-over occurs (see the Homologous and Site-Specific Recombination chapter). The recombinant chromosomes have the same organization as the parental chromosome; they contain precisely the same loci in the same order but include different combinations of alleles, providing the raw material for natural selection. However, the existence of duplicated sequences allows aberrant events to occur occasionally, which changes the number of copies of genes and not just the combination of alleles.

FIGURE 6.1 Chiasma formation and crossing-over can result in the generation of recombinants.

FIGURE 6.2 Crossing-over and recombination involve pairing between complementary strands of the two parental duplex DNAs.

Unequal crossing over (also known as nonreciprocal recombination) describes a recombination event occurring between two sites that are similar or identical but not precisely aligned. The feature that makes such events possible is the existence of repeated sequences. FIGURE 6.3 shows that this allows one copy of a repeat in one chromosome to misalign for recombination with a different copy of the repeat in the homologous chromosome instead of with the strictly homologous copy. When recombination occurs, it increases the number of repeats in one chromosome and decreases it in the other. In effect, one recombinant chromosome has a deletion and the other has an insertion. This mechanism is responsible for the evolution of clusters of related sequences. We can trace its operation in expanding or contracting the size of an array in both gene clusters and regions of highly repeated DNA.

FIGURE 6.3 Unequal crossing-over results from pairing between nonequivalent repeats in regions of DNA consisting of repeating units. Here, the repeating unit is the sequence ABC, and the third repeat of the light-blue chromosome has aligned with the first repeat of the dark-blue chromosome. Throughout the region of pairing, ABC units of one chromosome are aligned with ABC units of the other chromosome. Crossing-over generates chromosomes with 10 and 6 repeats each instead of the 8 repeats of each parent.

The highly repetitive fraction of the genome consists of multiple tandem copies of very short repeating units. These often have unusual properties. One is that they might be identified as a separate peak on a density gradient analysis of DNA (see the Methods in Molecular Biology and Genetic Engineering chapter); this is the origin of the name satellite DNA because the band containing the repetitive DNA is higher in the gradient than the main band. They often are associated with heterochromatic regions of the chromosomes and in particular with centromeres (which contain the points of attachment for segregation on a mitotic or meiotic spindle). As a result of their repetitive organization, they show some of the same evolutionary patterns as the tandem gene clusters. In addition to the satellite sequences, there are shorter stretches of DNA called minisatellites, tandem repeats in which each repeat is between roughly 10 and 100 base pairs (bp) in length, and they have similar properties. They are useful in showing a high degree of divergence between individual genomes that can be used for mapping or identification purposes.

All of these events that change the constitution of the genome are rare, but they are significant over the course of evolution.

6.2 Unequal Crossing-Over Rearranges Gene Clusters

Over a sufficiently long period of time, there are many opportunities for rearrangement in a cluster of related or identical genes. We can see the results by comparing the mammalian α-globin clusters (see the Genome Sequences and Evolution chapter for discussion of the evolution of the globin gene family). Although all β-globin clusters serve the same function and have the same general organization, each is different in size, there is variation in the total number and types of β-globin genes, and the numbers and structures of pseudogenes are different. All of these changes must have occurred since the mammalian radiation approximately 85 million years ago (the time of the common ancestor to all the mammals).

The comparison makes the general point that gene duplication, rearrangement, and variation are as important factors in evolution as the slow accumulation of point mutations in individual genes (see the chapter titled Genome Sequences and Evolution). What types of mechanisms are responsible for gene reorganization?

As described in the introduction, unequal crossing-over can occur as the result of pairing between two sites that are homologous in sequence but not in position. Usually, recombination involves corresponding sequences of DNA held in exact alignment between the two homologous chromosomes. However, when there are two copies of a gene on each chromosome, an occasional misalignment allows pairing between them. (This requires some of the adjacent regions to go unpaired.) This can happen in a region of short repeats or in a gene cluster. FIGURE 6.4 shows that unequal crossing-over in a gene cluster can have two consequences—quantitative and qualitative:

FIGURE 6.4 Gene number can be changed by unequal crossing-over. If gene 1 of one chromosome pairs with gene 2 of the other chromosome, the other gene copies are excluded from pairing. Recombination between the mispaired genes produces one chromosome with a single (recombinant) copy of the gene and one chromosome with three copies of the gene (one from each parent and one recombinant).

  • The number of repeats increases in one chromosome and decreases in the other. In effect, one recombinant chromosome has a deletion and the other has an insertion. This happens regardless of the exact location of the crossover. In the example in Figure 6.4, the first recombinant has an increase in the number of gene copies from two to three, whereas the second has a decrease from two to one.

  • If the recombination event occurs within a gene (as opposed to between genes), the result depends on whether the recombining genes are identical or only related. If the nonhomologous gene copies 1 and 2 are identical in sequence, there is no change in the sequence of either gene. However, unequal crossing-over can also occur when the sequences of adjacent genes are very similar (although the probability is less than when they are identical). In this case, each of the recombinant genes has a sequence that is different from either of the original sequences.

The determination of whether the chromosome has a selective advantage or disadvantage will depend on the consequence of any change in the sequence of the gene product as well as on the change in the number of gene copies.

An obstacle to unequal crossing-over is presented by the interrupted structure of the genes. In a case such as the globins, the corresponding exons of adjacent gene copies are likely to be similar enough to support pairing; however, the sequences of the introns have diverged appreciably. The restriction of pairing to the exons considerably reduces the continuous length of DNA that can be involved, lowering the chance of unequal crossing-over. So, divergence between introns could enhance the stability of gene clusters by hindering the occurrence of unequal crossing-over.

Thalassemias, inherited blood disorders resulting from abnormal hemoglobin, result from mutations that reduce or prevent synthesis of either α- or β-globin. The occurrence of unequal crossing-over in the human globin gene clusters is revealed by the nature of certain thalassemias. Many of the most severe thalassemias result from deletions of part of a cluster. In at least some cases, the ends of the deletion lie in regions that are homologous, which is exactly what would be expected if it had been generated by unequal crossing-over.

FIGURE 6.5 summarizes the deletions that cause the α-thalassemias. α-thal-1 deletions are long, varying in the location of the left end, with the positions of the right ends located beyond the known genes. They eliminate both of the α genes. The α-thal-2 deletions are short and eliminate only one of the two α genes. The L deletion removes 4.2 kilobases (kb) of DNA, including the α2 gene. It probably results from unequal crossing-over because the ends of the deletion lie in homologous regions, just to the right of the ψα and α2 genes, respectively. The R deletion results from the removal of exactly 3.7 kb of DNA, the precise distance between the α1 and α2 genes. It appears to have been generated by unequal crossing-over between the α1 and α2 genes themselves. This is precisely the situation depicted in Figure 6.4.

FIGURE 6.5 α-thalassemias result from various deletions in the α-globin gene cluster.

Depending on the diploid combination of thalassemic alleles, an affected individual can have any number of α chains from zero to three. There are few differences from the wild type (four α genes) in individuals with three or two α genes. However, if an individual has only one α gene, the excess β chains form the unusual tetramer β4, which causes hemoglobin H (HbH) disease. The complete absence of α genes results in hydrops fetalis, which is fatal at or before birth.

The same unequal crossing-over that generated the thalassemic chromosome should also have generated a chromosome with three α genes. Individuals with such chromosomes have been identified in several populations. In some populations, the frequency of the triple α locus is about the same as that of the single α locus; in others, the triple α genes are much less common than single α genes. This suggests that (unknown) selective factors operate in different populations to adjust the gene numbers.

Variations in the number of α genes are found relatively frequently, which suggests that unequal crossing-over in the cluster must be fairly common. It occurs more often in the α cluster than in the β cluster, possibly because the introns in α genes are much shorter and therefore present less of an impediment to mispairing between nonhomologous loci.

The deletions that cause β-thalassemias are summarized in FIGURE 6.6. In some (rare) cases, only the β gene is affected. These have a deletion of 600 bp, extending from the second intron through the 3′ flanking regions. In the other cases, more than one gene of the cluster is affected. Many of the deletions are very long, extending from the 5′ end indicated on the map for more than 50 kb toward the right.

FIGURE 6.6 Deletions in the β-globin gene cluster cause several types of thalassemia.

The Hb Lepore type provides the classic evidence that deletion can result from unequal crossing-over between linked genes. The β and δ genes differ by roughly 7% in sequence. Unequal crossing-over deletes the material between the genes, thus fusing them together (see Figure 6.4). The fused gene produces a single β-like chain that consists of the N-terminal sequence of δ joined to the C-terminal sequence of β.

Several types of Hb Lepore are known, with the difference between them lying in the point of transition from δ to β sequences. Thus, when the δ and β genes pair for unequal crossing-over, the exact point of recombination determines the position at which the switch from δ to β sequence occurs in the amino acid chain.

The reciprocal of this event has been found in the form of Hb anti-Lepore, which is produced by a gene that has the N-terminal part of β and the C-terminal part of δ. The fusion gene lies between normal δ and β genes. Although heterozygotes for this mutation are phenotypically normal, those that also carry a β deletion in trans show a mild β-thalassemia.

Evidence that unequal crossing-over can occur between more distantly related genes is provided by the identification of Hb Kenya, another fused hemoglobin. This contains the N-terminal sequence of the Aγ gene and the C-terminal sequence of the β gene. The fusion must have resulted from unequal crossing-over between Aγ and β, which differ by about 20% in sequence.

From the differences between the globin gene clusters of various mammals, we see that duplication (usually followed by diversification) has been an important feature in the evolution of each cluster. The human thalassemic deletions demonstrate that unequal crossing-over continues to occur in both globin gene clusters. Each such event generates a duplication as well as a deletion, and researchers must account for the fate of both recombinant loci in the population. Deletions can also occur (in principle) by recombination between homologous sequences lying on the same chromosome. This does not generate a corresponding duplication.

It is difficult to estimate the natural frequency of these events because evolutionary forces rapidly adjust the frequencies of the variant clusters in the population. Generally, a contraction in gene number is likely to be deleterious and selected against. However, in some populations, there might be a balancing advantage that maintains the deleted form at a low frequency. In particular, it might be that both homozygous and heterozygous carriers of a thalassemia deletion show resistance to certain infectious diseases, such as malaria. The form of balancing selection that can maintain such a mutation at a higher incidence is that heterozygotes might not show severe symptoms of thalassemia but benefit from the infectious disease resistance; because both normal and mutant alleles are carried by the heterozygote, selection maintains a “balance” of both alleles. Also, in small populations, genetic drift is likely to play a role in eliminating effectively neutral new duplications; in this mechanism, rare alleles are eliminated from population by chance events. The heterozygote again might not show symptoms, but if heterozygotes are rare in a population, they might either fail to reproduce or happen to not pass along the mutant allele, so the allele is lost from the population.

The structures of the present human clusters show several duplications that attest to the importance of such mechanisms. The functional sequences include two α genes encoding the same polypeptide, fairly similar β and δ genes, and two almost identical γ genes. These comparatively recent independent duplications have persisted in the species, not to mention the more ancient duplications that originally generated the various types of globin genes. Other duplications might have given rise to pseudogenes or have been lost. We expect ongoing duplication and deletion to be a feature of all gene clusters.

6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit

In the case of the globin genes discussed earlier, there are differences between the individual members of the cluster that allow selective pressure to act somewhat differently (but because of linkage, not independently) upon each gene. A contrast is provided by two cases of large gene clusters that contain many identical copies of the same gene or genes. Most eukaryotic organisms contain multiple copies of the genes for the histone proteins that are a major component of the chromosomes, and in most organismal genomes there are multiple copies of the genes that encode the ribosomal RNAs. These situations pose some interesting evolutionary questions.

Ribosomal RNA is the predominant product of transcription, constituting some 80% to 90% of the total mass of cellular RNA in both eukaryotes and prokaryotes. The number of major rRNA genes varies from 1 (in Coxiella burnetii, an obligate intracellular bacterium, and in Mycoplasma pneumoniae), to 7 in Escherichia coli, to 100 to 200 in unicellular/oligocellular eukaryotes, to several hundred in multicellular eukaryotes. The genes for the large and small rRNAs (found in the large and small subunits of the ribosome, respectively) usually form a tandem pair. (The sole exception is the yeast mitochondrion.)

The lack of any detectable variation in the sequences of the rRNA molecules implies that all of the copies of each gene must be identical. A point of major interest is what mechanism(s) are used to prevent variations from accumulating in the individual sequences.

In bacteria, the multiple rRNA genes are dispersed. In most eukaryotic genomes, the rRNA genes are contained in a tandem cluster or clusters. Sometimes these regions are called rDNA. (In some cases, the proportion of rDNA in the total DNA, together with its atypical base composition, is great enough to allow its isolation as a separate fraction directly from sheared genomic DNA.) An important diagnostic feature of a tandem cluster is that it generates a circular restriction map (see the Methods in Molecular Biology and Genetic Engineering chapter for a description of restriction mapping), as shown in FIGURE 6.7.

FIGURE 6.7 A tandem gene cluster has an alternation of transcription unit and nontranscribed spacer and generates a circular restriction map.

Suppose that each repeat unit has three restriction sites. When we map these fragments by conventional means, we find that A is next to B, which is next to C, which is next to A, generating the circular map. If the cluster is large, the internal fragments (A, B, and C) will be present in much greater quantities than the terminal fragments (X and Y) that connect the cluster to adjacent DNA. In a cluster of 100 repeats, X and Y would be present at 1% of the level of A, B, and C. This can make it difficult to obtain the ends of a gene cluster for mapping purposes.

The region of the nucleus where 18S and 28S rRNA synthesis occurs has a characteristic appearance, with a fibrillar core surrounded by a granular cortex. The fibrillar core is where the rRNA is transcribed from the DNA template, and the granular cortex is formed by the ribonucleoprotein particles into which the rRNA is assembled. The entire area is called the nucleolus. Its characteristic morphology is evident in FIGURE 6.8.

FIGURE 6.8 The nucleolar core identifies rDNA undergoing transcription and the surrounding granular cortex consists of assembling ribosomal subunits. This thin section shows the nucleolus of the newt Notophthalmus viridescens.

Photo courtesy of Oscar Miller.

The particular chromosomal regions associated with a nucleolus are called nucleolar organizers. Each nucleolar organizer corresponds to a cluster of tandemly repeated 18/28S rRNA genes on one chromosome. The concentration of the tandemly repeated rRNA genes, together with their very intensive transcription, is responsible for creating the characteristic morphology of the nucleoli.

The pair of major rRNAs is transcribed as a single precursor in both bacteria (where 5S and 16/23S rRNAs are cotranscribed) and the eukaryotic nucleolus (where the 18S and 28S rRNAs are transcribed). In eukaryotes, 5S genes are also typically found in tandem clusters transcribed as a precursor with transcribed spacers. Following transcription, the precursor is cleaved to release the individual rRNA molecules. The transcription unit is shortest in bacteria and is longest in mammals (where it is known as 45S RNA, according to its rate of sedimentation). An rDNA cluster contains many transcription units, each separated from the next by a nontranscribed spacer, so that many RNA polymerases are simultaneously engaged in transcription on one repeating unit. The polymerases are so closely packed that the RNA transcripts form a characteristic matrix displaying increasing length along the transcription unit.

The length of the nontranscribed spacer varies a great deal between and (sometimes) within species. In yeast there is a short nontranscribed spacer that is relatively constant in length. In the fruit fly Drosophila melanogaster there is nearly twofold variation in the length of the nontranscribed spacer between different copies of the repeating unit. A similar situation is seen in the amphibian Xenopus laevis. In each of these cases, all of the repeating units are present as a single tandem cluster on one particular chromosome. (In the example of D. melanogaster, this happens to be the sex chromosomes. The cluster on the X chromosome is larger than the one on the Y chromosome, so female flies have more copies of the rRNA genes than male flies do.)

In mammals the repeating unit is much larger, comprising the transcription unit of about 13 kb and a nontranscribed spacer of about 30 kb. Usually, the genes lie in several dispersed clusters; in the cases of humans and mice the clusters reside on five and six chromosomes, respectively. One interesting question is how the corrective mechanisms that presumably function within a single cluster to ensure that rRNA copies are identical are able to work when there are several clusters. Recent research suggests that selection might maintain a coordinated number of functional copies of genes among clusters on different chromosomes to ensure that dosages of different rRNA molecules (which must interact in forming a ribosome) remain approximately equal.

The variation in length of the nontranscribed spacer in a single gene cluster contrasts with the conservation of sequence of the transcription unit. In spite of this variation, the sequences of longer nontranscribed spacers remain homologous with those of the shorter nontranscribed spacers. This implies that each nontranscribed spacer is internally repetitious, so that the variation in length results from changes in the number of repeats of some subunit.

The general nature of the nontranscribed spacer is illustrated by the example of X. laevis (FIGURE 6.9). Regions that are fixed in length alternate with regions that vary in length. Each of the three repetitive regions comprises a variable number of repeats of a rather short sequence. One type of repetitious region has repeats of a 97-bp sequence; the other, which occurs in two locations, has a repeating unit found in two forms, both 60 bp and 81 bp long. The variation in the number of repeating units in the repetitive regions accounts for the overall variation in spacer length. The repetitive regions are separated by shorter constant sequences called Bam islands. (This description takes its name from their isolation via the use of the BamHI restriction enzyme.) From this type of organization, we see that the cluster has evolved by duplications involving the promoter region.

FIGURE 6.9 The nontranscribed spacer of X. laevis rDNA has an internally repetitious structure that is responsible for its variation in length. The Bam islands are short, constant sequences that separate the repetitious regions.

We need to explain the lack of variation in the expressed copies of the repeated genes. One hypothesis would be that there is a quantitative demand for a certain number of “good” sequences. However, this would enable mutated sequences to accumulate up to a point at which their proportion of the cluster is great enough for selection to act against them. We can exclude this hypothesis because of the lack of such variation in the cluster.

The lack of variation implies that there is negative selection against individual variations. Another hypothesis would be that the entire cluster is regenerated periodically from one or a very few members. As a practical matter, any mechanism would need to involve regeneration every generation. We can exclude this hypothesis because a regenerated cluster would not show variation in the nontranscribed regions of the individual repeats.

We are left with a dilemma. Variation in the nontranscribed regions suggests that there is frequent unequal crossing-over. This will change the size of the cluster but will not otherwise change the properties of the individual repeats. So, how are mutations prevented from accumulating? The following section shows that continuous contraction and expansion of a cluster might provide a mechanism for homogenizing its copies.

6.4 Crossover Fixation Could Maintain Identical Repeats

Not all duplicated copies of genes become pseudogenes. How can selection prevent the accumulation of deleterious mutations?

The duplication of a gene is likely to result in an immediate relaxation of the selection pressure on the sequence of one of the two copies. Now that there are two identical copies, a change in the sequence of one will not deprive the organism of a functional product, because the original product can continue to be encoded by the other copy. Then, the selective pressure on the two genes is diffused until one of them mutates sufficiently away from its original function to refocus all the selective pressure on the other.

Immediately following a gene duplication, changes might accumulate more rapidly in one of the copies, eventually leading to a new function (or to its disuse in the form of a pseudogene). If a new function develops, the gene then evolves at the same, slower rate characteristic of the original function. Probably this is the sort of mechanism responsible for the separation of functions between embryonic and adult globin genes.

Yet, there are instances in which duplicated genes retain the same function, encoding identical or nearly identical products. Identical polypeptides are encoded by the two human α-globin genes, and there is only a single amino acid difference between the two γ-globin polypeptides. How does selection maintain their sequence identities?

The most obvious possibility is that the two genes do not actually have identical functions but instead differ in some (undetected) property, such as time or place of expression. Another possibility is that the need for two copies is quantitative because neither by itself produces a sufficient amount of product.

However, in more extreme cases of repetition, it is impossible to avoid the conclusion that no single copy of the gene is essential. When there are many copies of a gene, the immediate effects of mutation in any one copy must be very slight. The consequences of an individual mutation are diluted by the large number of copies of the gene that retain the wild-type sequence. Many mutant copies could accumulate before a lethal effect is generated.

Lethality becomes quantitative, a conclusion reinforced by the observation that half of the units of the rDNA cluster of X. laevis or D. melanogaster can be deleted without ill effect. So how are these units prevented from gradually accumulating deleterious mutations? What chance is there for the rare favorable mutation to display its advantages in the cluster?

The basic principle of hypotheses that explain the maintenance of identity among repeated copies is to suppose that nonallelic genes are continually regenerated from one of the copies of a preceding generation. In the simplest case of two identical genes, when a mutation occurs in one copy, either it is by chance eliminated (because the sequence of the other copy takes over) or it is spread to both duplicates. Spreading exposes a mutation to selection. The result is that the two genes evolve together as though only a single locus existed. This is called concerted evolution or coincidental evolution. It can be applied to a pair of identical genes or (with further assumptions) to a cluster containing many genes. For example, the tandemly repeated rRNA gene copies discussed extensively earlier in the chapter show concerted evolution. rDNA clusters tend to have identical copies within genomes of a wide variety of prokaryotic and eukaryotic organisms, while showing variation among different species.

One mechanism for this concerted evolution is that the sequences of the nonallelic genes are directly compared with one another and homogenized by enzymes that recognize any differences. This can be done by exchanging single strands between them to form a duplex in which one strand derives from one copy and one strand derives from the other copy. Any differences are revealed as improperly paired bases, which are recognized by enzymes able to excise and replace a base, so that only A-T and G-C pairs remain. This type of event is called gene conversion and is associated with genetic recombination. Researchers should be able to ascertain the scope of such events by comparing the sequences of duplicate genes. If these duplicate genes are subject to concerted evolution, we should not see the accumulation of synonymous substitutions (those that do not change the amino acid sequence; see the Genome Sequences and Evolution chapter) between them because the homogenization process applies to these as well as to the nonsynonymous substitutions (those that do change the amino acid sequence). We know that the extent of the maintenance mechanism need not extend beyond the gene itself because there are cases of duplicate genes whose flanking sequences are entirely different. Indeed, we might see abrupt boundaries that mark the ends of the sequences that were homogenized.

We must remember that the existence of such mechanisms can invalidate the determination of the history of such genes via their divergence, because the divergence reflects only the time since the last homogenization/regeneration event, not the original duplication.

The crossover fixation model suggests that an entire cluster is subject to continual rearrangement by the mechanism of unequal crossing-over. Such events can explain the concerted evolution of multiple genes if unequal crossing-over causes all the copies to be physically regenerated from one copy.

Following the sort of event depicted in Figure 6.4, for example, the chromosome carrying a triple locus could suffer deletion of one of the genes. Of the two remaining genes, 1.5 represent the sequence of one of the original copies; only a half of the sequence of the other original copy has survived. Any mutation in the first region now exists in both genes and is subject to selection.

Tandem clustering provides frequent opportunities for “mispairing” of loci whose sequences are the same, but that lie in different positions in their clusters. By continually expanding and contracting the number of units via unequal crossing-over, it is possible for all the units in one cluster to be derived from rather a small proportion of those in an ancestral cluster. The variable lengths of the spacers are consistent with the idea that unequal crossing-over events take place in spacers that are internally mispaired. This can explain the homogeneity of the genes compared with the variability of the spacers. The genes are exposed to selection when individual repeating units are amplified within the cluster; however, the spacers are functionally irrelevant and can accumulate changes.

In a region of nonrepetitive DNA, recombination occurs between precisely matching points on the two homologous chromosomes, thus generating reciprocal recombinants. The basis for this precision is the ability of two duplex DNA sequences to align exactly. We know that unequal recombination can occur when there are multiple copies of genes whose exons are related, even though their flanking and intervening sequences might differ. This happens because of the mispairing between corresponding exons in nonallelic genes.

Imagine how much more frequently misalignment must occur in a tandem cluster of identical or nearly identical repeats. Except at the very ends of the cluster, the close relationship between successive repeats makes it impossible even to define the exactly corresponding repeats! This has two consequences: There is continual adjustment of the size of the cluster; and there is homogenization of the repeating unit.

Consider a sequence consisting of a repeating unit “ab” with ends “x” and “y.” If we represent one chromosome in black and the other in red, the exact alignment between “allelic” sequences would be as follows:

  1. xababababababababababababababababy

  2. xababababababababababababababababy

It is likely, however, that any sequence ab in one chromosome could pair with any sequence ab in the other chromosome. In a misalignment such as

  1. xababababababababababababababababy

  2. xababababababababababababababababy

the region of pairing is no less stable than in the perfectly aligned pair, although it is shorter. Researchers do not know very much about how pairing is initiated prior to recombination, but very likely it begins between short, corresponding regions and then spreads. If it begins within highly repetitive satellite DNA, it is more likely than not to involve repeating units that do not have exactly corresponding locations in their clusters.

Now suppose that a recombination event occurs within the unevenly paired region. The recombinants will have different numbers of repeating units. In one case, the cluster has become longer; in the other, it has become shorter,

  1. xababababababababababababababababy

  2. ×

  3. xababababababababababababababababy

  4. xababababababababababababababababababy

  5. +

  6. xababababababababababababababy

where “×” indicates the site of the crossover.

If this type of event is common, clusters of tandem repeats will undergo continual expansion and contraction. This can cause a particular repeating unit to spread through the cluster, as illustrated in FIGURE 6.10. Suppose that the cluster consists initially of a sequence abcde, where each letter represents a repeating unit. The different repeating units are related closely enough to one another to mispair for recombination. Then, by a series of unequal recombination events, the size of the repetitive region increases or decreases, and one unit spreads to replace all the others.

FIGURE 6.10 Unequal recombination allows one particular repeating unit to occupy the entire cluster. The numbers indicate the length of the repeating unit at each stage.

The crossover fixation model predicts that any sequence of DNA that is not under selective pressure will be taken over by a series of identical tandem repeats generated in this way. The critical assumption is that the process of crossover fixation is fairly rapid relative to mutation so that new mutations either are eliminated (their repeats are lost) or come to take over the entire cluster. In the case of the rDNA cluster, of course, a further factor is imposed by selection for a functional transcribed sequence.

6.5 Satellite DNAs Often Lie in Heterochromatin

Repetitive DNA is characterized by its relatively rapid rate of renaturation. The component that renatures most rapidly in a eukaryotic genome is called highly repetitive DNA and consists of very short sequences repeated many times in tandem in large clusters. As a result of its short repeating unit, it is sometimes described as simple sequence DNA. This component is present in almost all multicellular eukaryotic genomes, but its overall amount is extremely variable. In mammalian genomes it is typically less than 10%, but in (for example) the fruit fly Drosophila virilis, it amounts to about 50%. In addition to the large clusters in which this type of sequence was originally discovered, there are smaller clusters interspersed with nonrepetitive DNA. It typically consists of short sequences that are repeated in identical or related copies in the genome.

In addition to simple sequence DNA, multicellular eukaryotes have complex satellites with longer repeat units, usually in heterochromatin (but sometimes in euchromatic) regions. For example, Drosophila species have the 1.688 g-cmr−3 class of satellite DNA that consists of a 359-bp repeat unit. In humans, the α satellite family, found in centromeric regions, has a repeat unit length of 171 bp. The human β satellite family has 68-bp repeat units interspersed with a longer 3.3-kb repeat unit that includes pseudogenes.

The tandem repetition of a short sequence often has distinctive physical properties that researchers can use to isolate it. In some cases, the repetitive sequence has a base composition distinct from the genome average, which allows it to form a separate fraction by virtue of its distinct buoyant density. A fraction of this sort is called satellite DNA. The term satellite DNA is essentially synonymous with simple sequence DNA. Consistent with its simple sequence, this DNA might or might not be transcribed, but it is not translated. (In some species, there is evidence that short RNAs are required for heterochromatin formation, suggesting that there is transcription of sequences in heterochromatic regions of chromosomes, which contain satellite DNA; see the Regulatory RNA chapter.)

Tandemly repeated sequences are especially liable to undergo misalignments during chromosome pairing, and therefore the sizes of tandem clusters tend to be highly polymorphic, with wide variations between individuals. In fact, the smaller clusters of such sequences can be used to characterize individual genomes in the technique of “DNA profiling” (see the section Minisatellites Are Useful for DNA Profiling earlier in this chapter).

The buoyant density of a duplex DNA depends on its GC content according to the empirical formula:

ρ = 1.660 + 0.00098 (%GC) g-cm−3

Buoyant density is usually determined by centrifuging DNA through a density gradient of cesium chloride (CsCl). The DNA forms a band at the position corresponding to its own density. Fractions of DNA differing in GC content by more than 5% can usually be separated on a density gradient.

When eukaryotic DNA is centrifuged on a density gradient, two categories of DNA may be distinguished:

  • Most of the genome forms a continuum of fragments that appear as a rather broad peak centered on the buoyant density corresponding to the average GC content of the genome. This is called the main band.

  • Sometimes an additional, smaller peak (or peaks) is seen at a different value. This material is the satellite DNA.

Satellites are present in many eukaryotic genomes. They can be either heavier or lighter than the main band, but it is uncommon for them to represent more than 5% of the total DNA. A clear example is provided by mouse DNA, as shown in FIGURE 6.11. The graph is a quantitative scan of the bands formed when mouse DNA is centrifuged through a CsCl density gradient. The main band contains 92% of the genome and is centered on a buoyant density of 1.701 g-cm−3 (corresponding to its average GC content of 42%, typical for a mammal). The smaller peak represents 8% of the genome and has a distinct buoyant density of 1.690 g-cm−3. It contains the mouse satellite DNA, whose GC content (30%) is much lower than any other part of the genome.

FIGURE 6.11 Mouse DNA is separated into a main band and a satellite band by centrifugation through a density gradient of CsCl.

The behavior of satellite DNA in density gradients is often anomalous. When the actual base composition of a satellite is determined, it is different from the prediction based on its buoyant density. The reason is that ρ is a function not just of base composition but also of the constitution in terms of nearest neighbor pairs. For simple sequences, these are likely to deviate from the random pairwise relationships needed to obey the equation for buoyant density. In addition, satellite DNA can be methylated, which changes its density.

Often, most of the highly repetitive DNA of a genome can be isolated in the form of satellites. When a highly repetitive DNA component does not separate as a satellite, on isolation its properties often prove to be similar to those of satellite DNA. That is to say, highly repetitive DNA consists of multiple tandem repeats with anomalous centrifugation. Material isolated in this manner is sometimes referred to as a cryptic satellite. Together the cryptic and apparent satellites usually account for all the large tandemly repeated blocks of highly repetitive DNA. When a genome has more than one type of highly repetitive DNA, each exists in its own satellite block (although sometimes different blocks are adjacent).

Where in the genome are the blocks of highly repetitive DNA located? An extension of nucleic acid hybridization techniques allows the location of satellite sequences to be directly determined in the chromosome complement. In the technique of in situ hybridization, the chromosomal DNA is denatured by treating cells that have been “squashed.” Next, a solution containing a labeled single-stranded DNA or RNA probe is added. The probe hybridizes with its complementary sequences in the denatured genome. Researchers can determine the location of the sites of hybridization by a technique to detect the label, such as autoradiography or fluorescence.

Satellite DNAs are found in regions of heterochromatin. Heterochromatin is the term used to describe regions of chromosomes that are permanently tightly coiled up and inert, in contrast with the euchromatin that represents the active component of the genome (see the Chromosomes chapter). Heterochromatin is commonly found at centromeres (the regions where the kinetochores are formed at mitosis and meiosis for controlling chromosome segregation). The centromeric location of satellite DNA suggests that it has some structural function in the chromosome. This function could be connected with the process of chromosome segregation.

FIGURE 6.12 shows an example of the localization of satellite DNA for the mouse chromosomal complement. In this case, one end of each chromosome is labeled because this is where the centromeres are located in Mus musculus (mouse) chromosomes.

FIGURE 6.12 Cytological hybridization shows that mouse satellite DNA is located at the centromeres.

Photo courtesy of Mary Lou Pardue and Joseph G. Gall, Carnegie Institution.

6.6 Arthropod Satellites Have Very Short Identical Repeats

In arthropods, as typified by insects and crustaceans, each satellite DNA appears to be rather homogeneous. Usually, a single, very short repeating unit accounts for more than 90% of the satellite. This makes it relatively straightforward to determine the sequence.

The fly D. virilis has three major satellites and a cryptic satellite; together they represent more than 40% of the genome. TABLE 6.1 summarizes the sequences of the satellites. The three major satellites have closely related sequences. A single base substitution is sufficient to generate either satellite II or III from the sequence of satellite I.

TABLE 6.1 Satellite DNAs of D. virilis are related. More than 95% of each satellite consists of a tandem repetition of the predominant sequence.

Satellite Predominant Sequence Total Length Genome Proportion
I ACAAACT
TGTTTGA
1.1 × 107 25%
II ATAAACT
TATTTCA
3.6 × 106 8%
III ACAAATT
TGTTTAA
3.6 × 106 8%
Cryptic AATATAG
TTATATC

The satellite I sequence is present in other species of Drosophila related to D. virilis and so might have preceded speciation. The sequences of satellites II and III seem to be specific to D. virilis and so might have evolved from satellite I following speciation.

The main feature of these satellites is their very short repeating unit of only 7 bp. Similar satellites are found in other species. D. melanogaster has a variety of satellites, several of which have very short repeating units (5, 7, 10, or 12 bp). We can find comparable satellites in crustaceans.

The close sequence relationship found among the D. virilis satellites is not necessarily a feature of other genomes, for which the satellites might have unrelated sequences. Each satellite has arisen by a lateral amplification of a very short sequence. This sequence can represent a variant of a previously existing satellite (as in D. virilis), or it could have some other origin.

Satellites are continually generated and lost from genomes. This makes it difficult to ascertain evolutionary relationships, because a current satellite could have evolved from some previous satellite that has since been lost. The important feature of these satellites is that they represent very long stretches of DNA of very low sequence complexity, within which constancy of sequence can be maintained.

One feature of many of these satellites is a pronounced asymmetry in the orientation of base pairs on the two strands. In the example of the major D. virilis satellites shown in Figure 6.13, one of the strands is much richer in T and G bases. This increases its buoyant density so that upon denaturation this heavy strand (H) can be separated from the complementary light strand (L). This can be useful in sequencing the satellite.

6.7 Mammalian Satellites Consist of Hierarchical Repeats

In mammals, as typified by various rodent species, the sequences comprising each satellite show appreciable divergence between tandem repeats. Researchers can recognize common short sequences by their preponderance among the oligonucleotide fragments produced by chemical or enzymatic treatment. However, the predominant short sequence usually accounts for only a small minority of the copies. The other short sequences are related to the predominant sequence by a variety of substitutions, deletions, and insertions.

However, a series of these variants of the short unit can constitute a longer repeating unit that is itself repeated in tandem with some variation. Thus, mammalian satellite DNAs consist of a hierarchy of repeating units that can be detected by reassociation analyses or restriction enzyme digestion.

When any satellite DNA is digested with an enzyme that has a recognition site in its repeating unit, one fragment will be obtained for every repeating unit in which the site occurs. In fact, when the DNA of a eukaryotic genome is digested with a restriction enzyme, most of it gives a general smear due to the random distribution of cleavage sites. However, satellite DNA generates sharp bands because a large number of fragments of identical or almost identical size are created by digestion at restriction sites that lie a regular distance apart.

Determining the sequence of satellite DNA can be difficult. For example, researchers can cut the region into fragments with restriction endonucleases and attempt to obtain a sequence directly. However, if there is appreciable divergence between individual repeating units, different nucleotides will be present at the same position in different repeats, so the sequencing gels will not clearly identify the sequence. If the divergence is not too great—say, within about 2%—it might be possible to determine an average repeating sequence.

Individual segments of the satellite can be inserted into plasmids for cloning. A difficulty is that the satellite sequences tend to be excised from the chimeric plasmid by recombination in the bacterial host. However, when the cloning succeeds it is possible to determine the sequence of the cloned segment unambiguously. Although this gives the actual sequence of a repeating unit or units, we would need to have many individual such sequences to reconstruct the type of divergence typical of the satellite as a whole.

Using either sequencing approach, the information we can gain is limited to the distance that can be analyzed on one set of sequence gels. The repetition of divergent tandem copies makes it difficult to reconstruct longer sequences by obtaining overlaps between individual restriction fragments.

The satellite DNA of the mouse M. musculus is digested by the enzyme EcoRII into a series of bands, including a predominant monomeric fragment of 234 bp. This sequence must be repeated with few variations throughout the 60% to 70% of the satellite that is digested into the monomeric band. Researchers can analyze this sequence in terms of its successively smaller constituent repeating units.

FIGURE 6.13 depicts the sequence in terms of two half-repeats. By writing the 234-bp sequence so that the first 117 bp are aligned with the second 117 bp, we see that the two halves are quite similar. They differ at 22 positions, corresponding to 19% divergence. This means that the current 234-bp repeating unit must have been generated at some time in the past by duplicating a 117-bp repeating unit, after which differences accumulated between the duplicates.

FIGURE 6.13 The repeating unit of mouse satellite DNA contains two half-repeats, which are aligned to show the identities (in blue).

Within the 117-bp unit we can recognize two further subunits. Each of these is a quarter-repeat relative to the whole satellite. The four quarter-repeats are aligned in FIGURE 6.14. The upper two lines represent the first half-repeat of Figure 6.14; the lower two lines represent the second half-repeat. We see that the divergence between the four quarter-repeats has increased to 23 out of 58 positions, or 40%. The first three quarter-repeats are somewhat more similar and a large proportion of the divergence is due to changes in the fourth quarter-repeat.

FIGURE 6.14 The alignment of quarter-repeats identifies homologies between the first and second half of each half-repeat. Positions that are the same in all four quarter-repeats are shown in green. Identities that extend only through three-quarters of the quarter-repeats are in black, with the divergent sequences in red.

Looking within the quarter-repeats, we find that each consists of two related subunits (one-eighth-repeats), shown as the α and β sequences in FIGURE 6.15. The α sequences all have an insertion of a C and the β sequences all have an insertion of a trinucleotide sequence relative to a common consensus sequence. This suggests that the quarter-repeat originated by the duplication of a sequence like the consensus sequence, after which changes occurred to generate the components we now see as α and β. Further changes then took place between tandemly repeated αβ sequences to generate the individual quarter- and half-repeats that exist today. Among the one-eighth-repeats, the present divergence is 19/31 = 61%.

FIGURE 6.15 The alignment of eighth-repeats shows that each quarter-repeat consists of an α and a β half. The consensus sequence gives the most common base at each position. The “ancestral” sequence shows a sequence very closely related to the consensus sequence, which could have been the predecessor to the α and β units. (The satellite sequence is continuous so that for the purposes of deducing the consensus sequence we can treat it as a circular permutation, as indicated by joining the last GAA triplet to the first 6 bp.)

The consensus sequence is analyzed directly in FIGURE 6.16, which demonstrates that the current satellite sequence can be treated as derivatives of a 9-bp sequence. We can recognize three variants of this sequence in the satellite, as indicated at the bottom of the figure. If in one of the repeats we take the next most frequent base at two positions instead of the most frequent, we obtain three similar 9-bp sequences:

  1. G A A A A A C G T

  2. G A A A A A T G A

  3. G A A A A A A C T

The origin of the satellite could well lie in an amplification of one of these three nonamers (9-bp units). The overall consensus sequence of the present satellite is T, which is effectively an amalgam of the three 9-bp repeats.

FIGURE 6.16 The existence of an overall consensus sequence is shown by writing the satellite sequence as a 9-bp repeat.

The average sequence of the monomeric fragment of the mouse satellite DNA explains its properties. The longest repeating unit of 234 bp is identified by the restriction digestion. The unit of reassociation between single strands of denatured satellite DNA is probably the 117-bp half-repeat, because the 234-bp fragments can anneal both in register and in half-register (in the latter case, the first half-repeat of one strand renatures with the second half-repeat of the other).

So far, we have treated the present satellite as though it consisted of identical copies of the 234-bp repeating unit. Although this unit accounts for the majority of the satellite, variants of it also are present. Some of them are scattered randomly throughout the satellite, whereas others are clustered.

The existence of variants is implied by the description of the starting material for the sequence analysis as the “monomeric” fragment. When the satellite is digested by an enzyme that has one cleavage site in the 234-bp sequence, it also generates dimers, trimers, and tetramers relative to the 234-bp length. They arise when a repeating unit has lost the enzyme cleavage site as the result of mutation.

The monomeric 234-bp unit is generated when two adjacent repeats each have the recognition site. A dimer occurs when one unit has lost the site, a trimer is generated when two adjacent units have lost the site, and so on. With some restriction enzymes, most of the satellite is cleaved into a member of this repeating series, as shown in the example of FIGURE 6.17. The declining number of dimers, trimers, and so forth shows that there is a random distribution of the repeats in which the enzyme’s recognition site has been eliminated by mutation.

FIGURE 6.17 Digestion of mouse satellite DNA with the restriction enzyme EcoRII identifies a series of repeating units (1, 2, 3) that are multimers of 234 bp and also a minor series (½, 1½, 2½) that includes half-repeats (see accompanying text). The band at the far left is a fraction resistant to digestion.

Other restriction enzymes show a different type of behavior with the satellite DNA. They continue to generate the same series of bands. However, they digest only a small proportion of the DNA, say 5% to 10%. This implies that a certain region of the satellite contains a concentration of the repeating units with this particular restriction site. Presumably the series of repeats in this domain all are derived from an ancestral variant that possessed this recognition site (although some members since have lost it by mutation).

A satellite DNA suffers unequal recombination. This has additional consequences when there is internal repetition in the repeating unit. Let us return to our cluster consisting of “ab” repeats. Suppose that the “a” and “b” components of the repeating unit are themselves sufficiently similar to allow them to pair. Then, the two clusters can align in half-register, with the “a” sequence of one aligned with the “b” sequence of the other. How frequently this occurs depends on the similarity between the two halves of the repeating unit. In mouse satellite DNA, reassociation between the denatured satellite DNA strands in vitro commonly occurs in the half-register.

When a recombination event occurs out of register, it changes the length of the repeating units that are involved in the reaction:

  1. xababababababababababababababababy

  2. ×

  3. xababababababababababababababababy

  4. xabababababababababababababababababy

  5. +

  6. xababababababababbabababababababy

In the upper recombinant cluster, an “ab” unit has been replaced by an “aab” unit. In the lower cluster, an “ab” unit has been replaced by a “b” unit.

This type of event explains a feature of the restriction digest of mouse satellite DNA. Figure 6.17 shows a fainter series of bands at lengths of 0.5, 1.5, 2.5, and 3.5 repeating units, in addition to the stronger integral length repeats. Suppose that in the preceding example, “ab” represents the 234-bp repeat of mouse satellite DNA, generated by cleavage at a site in the “b” segment. The “a” and “b” segments correspond to the 117-bp half-repeats.

Then, in the upper recombinant cluster, the “aab” unit generates a fragment of 1.5 times the usual repeating length. In the lower recombinant cluster, the “b” unit generates a fragment of half of the usual length. (The multiple fragments in the half-repeat series are generated in the same way as longer fragments in the integral series, when some repeating units have lost the restriction site by mutation.)

Turning the argument around, the identification of the half-repeat series on the gel shows that the 234-bp repeating unit consists of two half-repeats closely related enough to pair sometimes for recombination. Also visible in Figure 6.17 are some rather faint bands corresponding to 0.25- and 0.75-spacings. These will be generated in the same way as the 0.5-spacings, when recombination occurs between clusters aligned in a quarter-register. The decreased relationship between quarter-repeats compared with half-repeats explains the reduction in frequency of the 0.25- and 0.75-bands compared with the 0.5-bands.

6.8 Minisatellites Are Useful for DNA Profiling

Sequences that resemble satellites (in that they consist of tandem repeats of a short unit) but that overall are much shorter—consisting of (for example) 5 to 50 repeats—are common in mammalian genomes. They were discovered by chance as fragments whose size is extremely variable in genomic libraries of human DNA. The variability is observed when a population contains fragments of many different sizes that represent the same genomic region; when individuals are examined, there is extensive polymorphism and many different alleles can be found.

Whether a repeat cluster is called a minisatellite or a microsatellite depends on both the length of the repeat unit and the number of repeats in the cluster. The name microsatellite is usually used when the length of the repeating unit is less than 10 bp; the number of repeats is smaller than that of minisatellites. The name minisatellite is used when the length of the repeating unit is roughly 10 to 100 bp and there is a greater number of repeats. However, the terminology is not precisely defined. These types of sequences are also called variable number tandem repeat (VNTR) regions. VNTRs used in human forensics are microsatellites that generally have fewer than 20 copies of a 2- to 6-bp repeat.

The cause of the variation between individual genomes at microsatellites or minisatellites is that individual alleles have different numbers of the repeating unit. For example, one minisatellite has a repeat length of 64 bp and is found in the population with the following approximate distribution:

7% 18 repeats
11% 16 repeats
43% 14 repeats
36% 13 repeats
4% 10 repeats

The rate of genetic exchange at minisatellite sequences is high, about 10−4 per kb of DNA. (The frequency of exchanges per actual locus is assumed to be proportional to the length of the minisatellite.) This rate is about 10 times greater than the rate of homologous recombination at meiosis for any random DNA sequence.

The high variability of minisatellites makes them especially useful for DNA profiling, because there is a high probability that individuals will vary in their alleles at such a locus. FIGURE 6.18 presents an example of mapping by minisatellites. This shows an extreme case in which two individuals are both heterozygous at a minisatellite locus, and in fact all four alleles are different. All progeny gain one allele from each parent in the usual way and it is possible to unambiguously determine the source of every allele in the progeny. In the terminology of human genetics, the meioses described in this figure are highly informative because of the variation between alleles.

FIGURE 6.18 Alleles can differ in the number of repeats at a minisatellite locus so that digestion on either side generates restriction fragments that differ in length. By using a minisatellite with alleles that differ between parents, the pattern of inheritance can be followed.

One family of minisatellites in the human genome shares a common “core” sequence. The core is a GC-rich sequence of 10 to 15 bp, showing an asymmetry of purine/pyrimidine distribution on the two strands. Each individual minisatellite has a variant of the core sequence, but about 1,000 minisatellites can be detected on a Southern blot (see the Methods in Molecular Biology and Genetic Engineering chapter) by a probe consisting of the core sequence.

Consider the situation shown in Figure 6.19 but multiplied many times by the existence of many such sequences. The effect of the variation at individual loci is to create a unique pattern for every individual. This makes it possible to unambiguously assign heredity between parents and progeny by showing that 50% of the bands in any individual are inherited from a particular parent. This is the basis of the technique known as DNA profiling.

Both microsatellites and minisatellites are unstable, although for different reasons. Microsatellites undergo intrastrand mispairing, when slippage during replication leads to expansion of the repeat, as shown in FIGURE 6.19. Systems that repair damage to DNA—in particular, those that recognize mismatched base pairs—are important in reversing such changes, as shown by a large increase in frequency when repair genes are inactivated (see the chapter titled Repair Systems). Mutations in repair systems are an important contributory factor in the development of cancer, so tumor cells often display variations in microsatellite sequences. Minisatellites undergo the same sort of unequal crossing-over between repeats that we have discussed for other repeating units. One telling case is that increased variation is associated with a recombination hotspot. The recombination event is not usually associated with recombination between flanking markers but has a complex form in which the new mutant allele gains information from both the sister chromatid and the other (homologous) chromosome.

FIGURE 6.19 Replication slippage occurs when the daughter strand slips back one repeating unit in pairing with the template strand. Each slippage event adds one repeating unit to the daughter strand. The extra repeats are extruded as a single-strand loop. Replication of this daughter strand in the next cycle generates a duplex DNA with an increased number of repeats.

It is not clear at what repeating length the cause of the variation shifts from replication slippage to unequal crossing-over.

Summary

  • Most genes belong to families, which are defined by the presence of similar sequences in the exons of individual members. Families evolve by the duplication of a gene (or genes), followed by divergence between the copies. Some copies suffer inactivating mutations and become pseudogenes that no longer have any function.

  • A tandem cluster consists of many copies of a repeating unit that includes the transcribed sequence(s) and a nontranscribed spacer(s). rRNA gene clusters encode only a single rRNA precursor. Maintenance of active genes in clusters depends on mechanisms such as gene conversion or unequal crossing-over, which cause mutations to spread through the cluster so that they become exposed to evolutionary forces such as selection.

  • Satellite DNA consists of very short sequences repeated many times in tandem. Its distinct centrifugation properties reflect its biased base composition. Satellite DNA is concentrated in centromeric heterochromatin, but its function (if any) is unknown. The individual repeating units of arthropod satellites are identical. Those of mammalian satellites are related and can be organized into a hierarchy reflecting the evolution of the satellite by the amplification and divergence of randomly chosen sequences.

  • Unequal crossing-over appears to have been a major determinant of satellite DNA organization. Crossover fixation explains the ability of variants to spread through a cluster.

  • Minisatellites and microsatellites consist of even shorter repeating sequences than satellites, generally less than 10 bp for microsatellites and roughly 10 to 100 bp for minisatellites, with a shorter cluster length than satellites have. The number of repeating units is usually 5 to 50. There is high variation in the repeat number between individual genomes. A microsatellite repeat number varies as the result of slippage during replication; the frequency is affected by systems that recognize and repair damage in DNA. Minisatellite repeat number varies as the result of recombination events. Researchers can use variations in repeat number to determine hereditary relationships by the technique known as DNA fingerprinting.

References

6.2 Unequal Crossing-Over Rearranges Gene Clusters

Research
  1. Bailey, J. A., et al. (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007.

6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit

Research
  1. Afseth, G., and Mallavia, L. P. (1997). Copy number of the 16S rRNA gene in Coxiella burnetii. Eur. J. Epidemiol. 13, 729–731.

  2. Gibbons, J. G., et al. (2015). Concerted copy number variation balances ribosomal DNA dosage in human and mouse genomes. Proc. Natl. Acad. S. USA 112, 2485–2490.

6.4 Crossover Fixation Could Maintain Identical Repeats

Research
  1. Charlesworth, B., et al. (1994). The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 371, 215–220.

6.6 Arthropod Satellites Have Very Short Identical Repeats

Research
  1. Smith, C. D., et al. (2007). The release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 316, 1586–1591.

6.7 Mammalian Satellites Consist of Hierarchical Repeats

Review
  1. Waterston, R. H., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.

6.8 Minisatellites Are Useful for DNA Profiling

Review
  1. Weir, B. S., and Zheng, X. (2015). SNPs and SNVs in forensic science. Forensic. Sci. Int-Gen. 5, e267–e268.

Research
  1. Jeffreys, A. J., et al. (1985). Hypervariable minisatellite regions in human DNA. Nature 314, 67–73.

  2. Jeffreys, A. J., et al. (1988). Spontaneous mutation rates to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature 332, 278–281.

  3. Jeffreys, A. J., et al. (1994). Complex gene conversion events in germline mutation at human minisatellites. Nat. Genet. 6, 136–145.

  4. Jeffreys, A. J., et al. (1998). High-resolution mapping of crossovers in human sperm defines a minisatellite-associated recombination hotspot. Mol. Cell 2, 267–273.

  5. Strand, M., et al. (1993). Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature 365, 274–276.