The Princeton Guide to Evolution

II.2

Phylogenetic Inference

Mark Holder

OUTLINE

1. Logical and statistical inference

2. The parsimony approach

3. Likelihood-based approaches

4. Distance-based approaches

5. Computational aspects of tree estimation

6. Statistical support for clades

7. Bayesian inference

A phylogeny describes the genealogical relationships between different species. In the early 1960s, many biologists were openly skeptical about the prospects of inferring reliable phylogenies. The last 50 years have produced a rich variety of statistical approaches for estimating evolutionary relationships and quantifying the degree of statistical support for different aspects of phylogenetic hypotheses. Current methods use powerful models of biological characters changing over evolutionary time to tease apart historical signals from similarities due to convergence. Today, phylogenetic inference is a routine part of many evolutionary studies, and phylogenies often provide a crucial framework for testing hypotheses.

GLOSSARY

Alignment. The process of adding gaps to DNA sequence data such that each column of the data matrix contains DNA bases that are homologous to each other (all derived from the same base in the common ancestor of the sequences). The aligned data matrix can be referred to as an alignment.

Character and Character State. In phylogenetic inference, a character refers to a comparable trait that can be studied in multiple species. Characters are hypothesized to be homologous (inherited from a common ancestral species). Character state refers to the specific form of the character observed in a species. For example, if the character is “number of limbs,” the character state for horses would be 4.

Clade. A subtree in a phylogeny. A group of species delimited by an ancestral species and all its descendants.

Data Pattern. The pattern of character states for a set of species. Two different characters that display the same pattern will contain the same phylogenetic signal.

Homoplasy. The creation of the same character state more than once over evolutionary history. Homoplasy can mislead phylogenetic inference because it results in a similarity that is not evidence of a close evolutionary relationship.

1. LOGICAL AND STATISTICAL INFERENCE

A phylogenetic tree is a representation of the ancestor-descendant relationships between different species (see chapter II.1). We can collect data from extinct and extant species, and we can directly observe the genealogical relationships of individuals within a population, but phylogenetic relationships must be inferred.

Inference procedures can be classified as logical versus statistical. Logical inference has an appealing property: if our input “premises” are correct and our inferential rule is valid, then our logical conclusion must be correct. As a result, it is tempting to try to cast phylogenetic inference problems into the realm of logical inference; for example, one possible inferential rule might be the following:

Rule #1: Any two species sharing a homologous attribute (a “character state”) must be more closely related to each other than either one is to a species not sharing this character state.

Homologous, in this context, refers to attributes in different organisms that are similar to each other because they were inherited from a common ancestor (see chapter II.7).

One could collect data and arrange them into a matrix in which various columns represented distinct, heritable traits (such as the number of digits on a forelimb), with each row corresponding to a different species. By examining this data matrix and repeatedly applying Rule #1, we could build up a phylogenetic tree piece by piece. Our rule allows us to learn about a piece of the tree by looking for shared character states.

Consider a character corresponding to the concept of “number of limbs” and rows for a human, a lizard, and a snake. Humans and lizards share a trait (presence of four limbs). We are reasonably certain that the most recent common ancestor of humans and lizards had four comparable limbs; thus, the attributes we are scoring are homologous character states (determining which similar attributes represent homologous character states is not a trivial problem; see chapter II.7). It appears we should be able to use our rule to infer that humans are more closely related to lizards than either is to snakes; unfortunately, this result is incorrect. Analysis of other characters (such as the structure of the male reproductive organs, the hemipenes) immediately leads to a conflicting conclusion that lizards and snakes are more closely related to each other than either are to mammals. Clearly, Rule #1 is too simplistic.

Willi Hennig, a German entomologist, dramatically clarified the logic of phylogenetic inference by demonstrating that Rule #1 cannot provide a firm foundation for reconstructing a phylogeny. Hennig pointed out that if we modify the rule to use only character states corresponding to evolutionary novelties (“apomorphies” in his terminology), we ought to arrive at correct inferences. In the context of the human/lizard/snake example, the novel character state is the lack of limbs in snakes. Presence of limbs is not a novelty, because the most recent common ancestor of all three groups had limbs. So, this similarity in homologous traits is not a reason to posit a close relationship between humans and lizards; however, the presence of evertible hemipenes in lizards and snakes is an evolutionary innovation (relative to the most recent common ancestor of humans, lizards, and snakes). Thus, Hennig’s approach groups snakes and lizards as more closely related to each other.

Yet even in this simple example we can see some difficulties. Under Hennig’s rule, we need a data matrix that makes statements about which character states are homologous and which are ancestral rather than derived (the “polarity” of characters). Before knowing the phylogenetic tree for a group, it is hard to imagine being certain of the attributes of an ancestral species. While we can use information from developmental biology, detailed structural analysis, paleontology, and comparisons to more distantly related organisms to inform polarity decisions, there is no way to avoid all mistakes. Indeed, when we look at real data sets, we almost invariably find conflict between characters. This would not happen if all our homology and polarity diagnoses were correct. When we are not 100 percent certain of our premises, logical inference cannot guarantee the correctness of conclusions. In fact, trying to conduct a logical analysis with conflicting premises will usually lead to no conclusion at all; instead, we must move to the realm of statistical inference.

In statistical estimation the observed data are used to determine the best-fitting value (or range of values) of an unknown quantity (a parameter). Statistical methods provide estimation procedures that account for the possibility of random errors, chance phenomena that might cause the observed data to be somewhat different than expected. For example, a statistical method would have to allow that flipping a fair coin could once in a while yield five consecutive heads. The value for the parameter that is considered optimal is referred to as the estimate of the parameter. The formula or procedure used to create the estimate is referred to as the estimator. In phylogenetic estimation, the tree and the lengths of its branches are the parameters of interest. In general, statistical estimators work by finding parameter values tending to produce data that is similar to the data that we have observed. This match between parameters and data can be assessed in a number of ways, leading to a wide variety of frameworks for developing estimators.

To move from a general description of statistical estimators to a concrete estimation procedure, we must describe the relationship between parameter and data in the case of phylogenetic inference. We must explicitly describe some sort of error model. Hennigian arguments predict that two closely related species will share derived character states that neither shares with more distantly related organisms. The prediction makes sense, because any novelties that evolve along a lineage should be inherited by the descendant lineages, and they should distinguish those descendant species from all other species; however, multiple changes in the same character can occur, and these will obscure the historical signal. For a full error model, we must also describe how data patterns other than “perfect Hennigian” characters can evolve.

An example using primate phylogeny. Consider an alignment of the entire mitochondrial genome sequence for a human, an orangutan, a baboon, and a squirrel monkey (a small portion of the alignment is shown in table 1). Based on other data, we are confident in the correct phylogeny for this group of species. A wide variety of morphological and biogeographic evidence support the hypothesis that, among these four species, human and orangutan are sister to each other, and the Old World primates (human, orangutan, and baboon) are more closely related to each other than they are to the New World squirrel monkey. It is possible for the topology of a gene tree for a specific locus to differ from the species tree for a variety of reasons, but for very good reasons (that we need not go into here) we can be confident that the true mitochondrial gene tree resembles the species phylogeny.

Table 1. Characters (also referred to as “sites”) 21–35 of an alignment of mitochondrial genomes of four species of primates

Despite our confidence in the genealogy of the mitochondrial genome before even examining the sequence data, we can see a wide variety of data patterns. In fact, ignoring sites with gaps that arise as a result of insertion or deletion events, there are 256 possible data patterns, derived from the number of nucleotides (4) raised to the number of species (4). Many sites will show the same nucleotide for all four species; they will appear as “constant” sites in the matrix (such as the last four sites in the matrix shown in table 1). Many other sites will show one DNA base in one of the species, and another base in the other three species. Because the squirrel monkey is the out-group (the most distantly related species) in this tree, its sequence has been on a distinct evolutionary trajectory for a longer period of time than those of the other species. Thus, we might expect more sites in which it differs from the other three species (e.g., site 23 in table 1). As the Hennigian logic suggests, we would expect to observe sites in which human and orangutan share a state with each other, while the squirrel monkey and baboon share a different nucleotide.

If positions in the mitochondrial genome behaved like “perfect” phylogenetic characters, then the same site would never change more than once, meaning that we could easily analyze the data using Hennigian logic to obtain the correct phylogeny. In reality, there is a substantial probability that some sites will experience multiple mutational events over the course of this evolutionary history (spanning more than 30 million years of evolution); indeed, some sites in the actual mitochondrial alignment display all four nucleotides, indicating that at least three changes must have occurred. Thus, despite the fact that we expect a relatively high proportion of sites (such as site 31 in table 1) that support the grouping of human with orangutan, we should expect some sites to show a conflicting signal. In this example, the total alignment length is 16,767 sites. Of these, 654 sites show data patterns that favor grouping human and orangutan, 320 favor grouping orangutan and baboon (e.g., site 22 in table 1), and 275 favor a tree that places human and baboon together (e.g., site 21 in table 1). From the perspective of logical analysis, conflicting signals in the data indicate that we cannot treat each character as an inerrant indicator of phylogenetic history. From the statistical perspective, we can still make progress if we can model the connection between various parameters (trees in this case) and the data. This modeling could be derived from a mechanistic understanding of the process by which different data patterns can arise on a tree, or the model could simply predict properties of the data without trying to derive them from first principles about the processes of character evolution.

2. THE PARSIMONY APPROACH

Using a different model of the evolution of characters leads to different estimation procedures. If we think that changes to any character are rare, then we expect to find very few examples of homoplasy (see glossary). If we try to find the tree that implies the least amount of homoplasy, we will be led to a parsimony criterion: we prefer the tree that requires the fewest changes in character state to explain the data. Calculating the smallest number of changes required to explain the data sounds daunting. The character states for the ancestral species are not observed, so we cannot simply count the number of changes that occur on each branch of the tree. Fortunately, very efficient algorithms developed in the 1970s can calculate the minimum number of steps required to explain the data (the parsimony score) in one sweep down the tree.

Computer simulation studies have been widely used to study the behavior of tree estimators. Because the true tree for a simulation is known, the accuracy of inference methods can be tested. Even though the parsimony procedure for estimating trees appears to rely on the rarity of changes, many simulation studies have found that parsimony can accurately estimate trees in which large numbers of character changes have occurred. Having a low number of changes per branch seems a more important predictor of when a parsimony estimate of the phylogeny will be reliable. If taxon sampling is very dense, parsimony may be able to accurately reconstruct trees even when the total amount of evolution is large.

Despite the fact that parsimony often yields reliable estimates, it has been shown to be sensitive to unequal branch lengths on the tree (as was pointed out by Felsenstein in 1978). Long branches on a tree can be caused by long periods of time between speciation events, or a high rate of character change, or a combination of these factors. Long branches can cause the conflicting signal in our data to be concentrated in specific misleading ways. If two branches on the tree experience a large number of character changes, the probability can be relatively high that they will converge on the same character state independently. Parsimony assigns a penalty of one step to each required change, regardless of where on the tree the change occurs; thus, parsimony will not account for the existence of long branches. When homoplasy is concentrated on a few branches of the true evolutionary tree, parsimony will often erroneously place some of the “long-branch” taxa together. The problem is particularly severe for character types displaying a very limited set of states (such as the four nucleotides in DNA sequence information). Felsenstein (1978) showed that there are cases in which parsimony will reconstruct the tree incorrectly, even if given an unlimited supply of data.

3. LIKELIHOOD-BASED APPROACHES

Developing estimators that account for unequal branch lengths requires us to treat the lengths of branches as parameters. Rather than just estimating the tree topology, we must introduce branch lengths into the estimation machinery. The true branch lengths are unknown, but we can assign them values most compatible with the data. The goal of maximum likelihood (ML) tree estimation procedure is to find the combination of tree topology and branch lengths that maximizes a likelihood score.

A fully specified model is a specific set of parameters—a point in “parameter space.” If the observed data are what would be expected to arise under a model, then that model fits the data well. The likelihood is a way to assess this fit. The probability that a model with a given set of parameter values would generate a data set identical to the observed data is referred to as the likelihood of the model. Note that the “likelihood of a tree” in the statistical sense is not the same as the “probability that a tree is correct.” In everyday usage, probability and likelihood are used synonymously, but in statistical inference the likelihood of a tree is a probability statement about the data if we assume that the tree is correct.

Given a particular data matrix, we can calculate the likelihood of any tree model: the probability that it would have given rise to exactly these data. If a model states that the observed data are impossible, then the model will have a likelihood of 0 and be rejected. In phylogenetic analyses, however, a tree model will never completely rule out any data set; nonetheless, some data sets have a very low probability of being generated under a particular tree. If such a data set is observed, then that tree is a poor estimate. In contrast, a tree with a high likelihood is a better estimate, and the tree that results in the maximum likelihood value among all trees is viewed as the best estimate of the tree.

To infer trees using ML, we must be able to assign probabilities to any data pattern that can occur. The probability statements for a data pattern can be constructed by considering the probability of all possible character state changes across a single branch in the tree (the transition probabilities for the branch). By assuming that evolutionary events on different branches are independent, one can combine per-branch transition probabilities into probabilities for the evolutionary history of a character across the entire tree. It is often convenient to consider the rates at which different changes could occur. Mathematical transformations allow us to extrapolate the effects of an evolutionary process occurring at a certain rate over any timescale of interest.

After a description of character evolution is formulated in terms of rates of change, it is possible to calculate a likelihood for any combination of tree topology and branch lengths. Felsenstein’s (1981) pruning algorithm makes the probability calculations feasible. However, the need to maximize the likelihood score over a large number of parameters still makes ML much slower than parsimony.

By maximizing the likelihood we can find the parameter values that match the data most closely. This provides estimates of the tree and of branch lengths. One implication of treating branch lengths as unknown parameters is that every character we observe in a data matrix provides information. An entirely constant character will not “directly” prefer one tree over another, but it will provide evidence that branch lengths are short. This could have an indirect effect on which tree best fits the data, so even constant characters can alter tree inference.

The inclusion of an explicit model of character evolution is both a strength of ML methods and a target of criticism. ML methods can use all the available data to pick up on fairly subtle patterns. ML is much more resistant to branch length inequality than parsimony; however, misspecification of the model can lead to incorrect tree inference. Our models of character evolution are dramatic oversimplifications of the real evolutionary processes. Fortunately, numerous computer simulations have demonstrated that ML tree estimation is fairly robust against violations in the details of the model of character evolution, as long as the dominant aspects of evolution (e.g., unequal branch lengths, different rates of evolution for different characters) are incorporated into the models.

4. DISTANCE-BASED APPROACHES

Calculating the likelihood of a tree involves considering the probabilities associated with a huge number of possible evolutionary scenarios that could have led to the observed data. Furthermore, ML inference must consider a huge parameter space of branch lengths and rates of character change (in addition to the space of all trees). Distance-based methods for tree reconstruction simplify the tree inference problem by trying to explain only the observed divergences between the tips of the tree.

A tree makes a prediction about the evolutionary distance (the number of character state changes that have occurred) between each pair of tips of the tree. We can observe pairwise divergence in the characters that we study, so we have an empirical estimate of the tip-to-tip distance. From the character matrix, we can calculate a divergence between each taxon to every other taxon and summarize these calculations in a taxon-by-taxon distance matrix. The combination of tree topology and branch lengths with tip-to-tip divergences closest to the observed distance matrix is judged to be the best estimate of phylogeny.

Distance-based approaches treat the distance matrix as if it were the only data relevant to tree inference. Because distance methods do not have to “map” evolutionary events on the tree for each character, they can be very fast. The price paid for this computational benefit is unclear. Condensing a character matrix into a distance matrix implies a loss of information. When we compare characters between two taxa (tips on a tree) the number of differences represents a minimum number of evolutionary events that must have occurred. This minimum will usually underestimate the actual number of events. Thus, the observed pairwise distance matrix is not an error-free representation of the evolutionary distance between tips. Models can be used to correct the pairwise distance estimates for repeated changes at the same position (the “multiple hits” problem). But even a corrected pairwise distance is often an imprecise estimate of the true number of evolutionary events. Relying on a summary of the data (the distance matrix) rather than the full data should make distance methods less powerful than character-based methods. In general, no compelling statistical reasons have been advanced for preferring distance-based methods over likelihood-based approaches. Distance-based approaches continue to be widely used, however, because they provide reasonable estimates of the tree very quickly even for very large data sets.

5. COMPUTATIONAL ASPECTS OF TREE ESTIMATION

The preceding sections have focused on the statistical basis of estimating a tree, specifically on the correspondence between various estimation methods and different ways of assessing the fit between a phylogenetic hypothesis and the observed data. Developing the computational machinery to conduct phylogenetic inference is a complementary, and very active, area of research. Whether we use an ML score, a parsimony score, or a distance-based score, we must still find the tree that produces the optimal score. The number of possible trees is enormous, so scoring every possible tree is not feasible. In general, software for phylogenetic estimation works by generating a rough initial solution, then trying to improve the estimate by looking at similar trees.

A procedure called stepwise addition builds up a tree estimate by adding taxa to a growing tree one at a time. At each step the attachment point with the best score for a taxon is chosen. The procedure is not guaranteed to produce the best-scoring tree. Placements made in early steps may be suboptimal when new data are added to the tree. The initial approximation of the tree obtained by stepwise addition can be perturbed by rearranging a few of the relationships while keeping most of the tree’s structure intact. If the perturbation results in a tree with a better score, we have improved our solution, and we can continue searching for a better tree. If we try a large number of perturbations and fail to find a tree with abetter score, we can terminate the search. The final tree will be a good approximation of the tree with the optimal score even if we cannot guarantee that our search found the best tree. Repeatedly performing searches from different starting points can reveal whether this type of hill-climbing approach appears to be working on a particular data set. If each starting point yields a different final tree, then the landscape of tree scores is very complex and there is a good chance that none of the searches identified the “global” optimum.

Many variations of this general strategy of tree searching have been studied, and it is now feasible to reconstruct trees of hundreds and even thousands of taxa. When dealing with large data sets, one can rarely be confident that the optimal tree has been found; however, it is unlikely that one could reconstruct a huge tree with no error at all. The crucial question becomes, “What aspects of the tree are strongly supported by the data?” (see below). Strongly supported branches in a tree are usually easy to find during tree searching, so most phylogenetic analyses are limited more by the amount of information in the data rather than by the efficiency of tree-searching software.

6. STATISTICAL SUPPORT FOR CLADES

With enough computational resources, we can be confident that a given phylogeny represents the best estimate of evolutionary relationships that can be obtained from our data. Explicit criteria can help us choose among a set of alternative families of models, and formal tests of model adequacy can identify cases in which our inferential models are clearly unrealistic. Even with a satisfactory model of the evolutionary processes generating the data, we still must acknowledge the possibility that limitations in our data can lead to an incorrect estimate. Our estimates are based on a finite sample of data.

Given a clade in our estimate of phylogeny, such as the grouping of human with orangutan, we would like to know whether the grouping could simply be the result of sampling error rather than a true evolutionary signal. Sampling error refers to the mistakes in estimation caused by a small sample of data. A common approach in statistics is to calculate a P value to evaluate the strength of evidence about a proposition. Roughly speaking, a P value is the probability of seeing at least as much evidence against a proposition even if the proposition is true. It helps us assess whether it is plausible to discount our result as merely an artifact of sampling error.

To calculate a P value for a clade within a phylogeny, we quantify the support for the group in a numerical statistic. The most appealing choice is the difference in score between the best-scoring tree (our estimate) and the best tree that does not contain the clade of interest. For example, in the primate mitochondrial genome example, the tree with the best parsimony score grouped human and orangutan; this tree required 7990 changes to explain the data. The best alternative tree places orangutan with the baboon; that tree required 8324 steps to explain the sequence data; thus, we can say that the human + orangutan tree is 334 changes better than the next-best tree. Obtaining a large difference in scores clearly implies that we have more compelling evidence. This same style of analysis could have been conducted using ML scores or distance-based approaches to scoring trees.

We might want to calculate a P value for the hypothesis that human and orangutan are not close relatives. To do this we must answer the question, “If human and orangutan were not a true group on this phylogeny and we randomly sampled 16,767 sites, what is the probability that we would obtain a data set that yields a score difference of at least 334 steps in favor of an incorrect tree?” This is not a trivial question to answer. Phylogenetic trees are difficult parameters to deal with, and we do not have the convenience of calculating a simple number for a tree and looking it up in a standard statistical table; nevertheless, we can still apply the core insights of statistical testing and identify groupings in aphylogenetic estimate that are weakly supported and likely to be overturned by subsequent analyses. If we think our data are very “clean” (unlikely to generate patterns that support spurious groupings), then it seems very unlikely that we would see a score difference of 334 entirely from sampling error. If our data appear to have lots of homoplasy, we might see this much-erroneous signal. Fortunately, our actual data set gives a hint about how “messy” the signal was. We can look at the number of sites that supported different groupings in the original data set to see how variable the score would be as a result of resampling. Alternatively, we could use a computer to simulate the effect of sampling error by generating many artificial data sets. By counting the proportion of simulated data sets that display at least 334 steps of support for spurious groupings, we can approximate a P value.

Bootstrapping is the most common way to assess the effect of sampling error on tree inference. In bootstrapping, we create many “pseudoreplicates” of our original data by randomly sampling from the pool of characters we observed. In each pseudoreplicate, a different set of characters will be overrepresented and another set will be excluded. By conducting phylogenetic analysis on each of these pseudoreplicates, we can discover which groupings in the tree are sensitive to sampling error. The bootstrap proportion for a group is the proportion of pseudoreplicate analyses that supported the group. Bootstrapping is computationally demanding, because hundreds of tree searches must be performed, and it does not directly yield a P value; however, it does provide a useful summary of clades that are well supported.

7. BAYESIAN INFERENCE

Bayesian statistics is a major branch of statistical theory, and it has had a large impact on phylogenetic inference. Bayesian approaches formalize the ways we can use data to update our beliefs about the world. We start by considering all parameter values, assigning each parameter value a prior probability. This prior probability represents our degree of belief before examining the data. For instance, in the case of the primate tree, a person who had never studied anthropology or mammalogy might be completely uncertain about whether the correct grouping for the Old World primates would be human + orangutan, human + baboon, or orangutan + baboon. In such a case, the person might assign each tree a prior probability of 1/3 to reflect the fact that he thinks each scenario is equally likely (and the sum of probabilities must be 1). Technically speaking, this prior probability for the tree topology is actually an integral of probability densities over all the possible branch-length combinations for the tree. Before looking at the data, few people would be confident about specifying a reasonable branch length, but some combinations might seem implausible. For example, we might be surprised if the branch leading to human were thousands of times longer than the branch leading to its closest relative; by assigning such combinations low prior probability, one is able to bring previously learned insights to bear on an analysis. There is an element of subjectivity or arbitrariness in prior specification.

The next step is to update prior beliefs into posterior probability statements. This step is not subjective at all; in fact, it is simply an exercise in applying the rules of probability. Specifically, Bayes’ theorem states that the probability associated with a parameter value in light of the data is proportional to the parameter value’s prior probability multiplied by the likelihood of that parameter value; thus, Bayesian inference is closely tied to exactly the same likelihood function that forms the basis of ML inference.

One very attractive feature of Bayesian inference is its ability to produce a single-best estimate of a parameter (for instance, the phylogeny with the highest posterior probability), but also an easily interpreted statement of support. If the posterior probability for a tree or a clade is close to 1, then those aspects of evolutionary history are strongly supported. If the tree with the highest posterior probability has a probability of 0.4, then we immediately know it is a questionable inference.

The result of a Bayesian analysis is a summary that blends any prior knowledge with information from the data. In practice, we often have only vague knowledge of the model before seeing the data, so the effect of the likelihood dominates the inference. In such cases, the results of Bayesian point estimation are usually similar to ML point estimates. If we do have strong biological knowledge about some aspect of character evolution, then we can express that as a prior probability statement and use that information in a Bayesian framework.

Because Bayesian techniques aim to describe the probability of all possible parameter values, the computational tools required differ considerably from the tree-searching tools used to find parsimony or ML tree estimates. In the vast majority of cases, Bayesian phylogenetic inference is conducted by a computer-simulated walk through the space of all parameter values. We can design the rules for the simulation in a very specific way so the walk tends to avoid parameter values with low likelihood (or lower prior probability). Running these simulations for a large number of iterations provides a set of parameters that are sampled in proportion to their posterior probability. This Markov chain Monte Carlo simulation approach is an elegant solution to the difficulties of exploring a large parameter space, but it requires considerable care to ensure it provides reliable results.