Principles of Microbial Diversity. James W. Brown
B
one might start with sequences A and B (usually in practice, the sequences are chosen randomly):
Then the next sequence is added to the tree such that the distances between A, B, and C are approximately equal to the evolutionary distances:
Note that the fit is not perfect. If we could determine the evolutionary distances exactly, they would fit the tree exactly, but since we have to estimate these distances, the numbers are fit to the tree as closely as possible using averaging or least-squares best fit.
The next step is to add the next sequence, again readjusting the tree to fit the distances as well as possible:
And at last we can add the final sequence and readjust the branch lengths one last time using least-squares:
Neighbor joining and Fitch are both least-squares distance-matrix methods, but a big difference is that neighbor joining separates the determination of the tree structure from solving branch lengths, whereas Fitch solves them together but does so by adding branches (sequences) one at a time.
Parsimony
Parsimony is actually a collection of related methods based on the premise of Occam’s razor. In other words, the tree that requires the smallest number of sequence changes to generate the sequences in the alignment is the most likely tree.
For example:
No distance matrix is calculated; instead, trees are searched and each ancestral sequence is calculated, allowing for all uncertainties, in a process analogous to Sudoku puzzles. The number of “mutations” required is added up, and the tree with the best score wins. Testing every possible tree is not usually possible (the number of trees grows exponentially with the number of sequences), so a variety of search algorithms are used to examine only the most likely trees. Likewise, there are a variety of ways of counting (scoring) sequence changes.
Parsimony methods are typically slower than distance-matrix methods but very much faster than the maximum-likelihood methods described below. Parsimony uses more of the information in an alignment, since it does not reduce all of the individual sequence differences to a distance matrix, but it seems to work best with relatively closely related sequences and is not usually used for rRNA sequences.
Maximum likelihood
The maximum-likelihood method turns the tree construction process on its head, starting with a cluster analysis to generate a “guide” tree, from which a very complete substitution model is calculated. The algorithm then goes back and calculates the likelihood of any particular tree by summing the probabilities of all of the possible intermediates required to get to the observed sequences. Rather than try to calculate this for all possible trees, a heuristic search is used starting with the guide tree. Sound complicated? It is, and maximum-likelihood tree construction is by far the most computationally intensive of the methods in common use. However, it is generally also the best, in the sense that the trees are more consistent and robust. The limitation is that fewer and shorter sequences can be analyzed by the maximum-likelihood method because of its computational demands. A tree that might take a few seconds by neighbor joining or a few minutes by parsimony or Fitch can take a few hours or a couple of days by maximum likelihood. This is serious; it means that you cannot usually “play” with trees, testing various changes in the data or treeing parameters and seeing the result immediately.
Bayesian inference
Bayesian inference is a relatively new approach to tree construction. This approach starts with a random tree structure, random branch lengths, and random substitution parameters for an alignment, and the probability of the tree being generated from the alignment with these parameters is scored. Obviously the initial score is likely to be very poor. Then a random change is made in this tree (branch order, branch length, or substitution parameter) and the result is rescored. Then a choice is made whether to accept the change; this choice is partially random, but the greater the improvement in tree score, the more likely it is to be accepted. If the change is accepted, the process is repeated starting with this new tree; if the change is rejected, the process is repeated starting with the old tree. After many, many cycles of this process, the algorithm settles in to a collection of trees that are nearly optimal. Various tricks are used to keep the algorithm from getting stuck in local-scoring minimum zones.
Bootstrapping
Unfortunately, it is mathematically impossible to extract confidence intervals (χ2 or p values) for nodes in a tree generated using the standard treeing methods; in other words, we cannot determine how sure we should be of the placement of each node in a tree (commonly referred to as branching order). This is not true of branch lengths, from which confidence intervals can be readily calculated. The standard method of evaluating tree branching-order reliability is called bootstrapping.
In a bootstrap analysis, the columns of a sequence alignment are randomly sampled to create a jumbled and haphazard alignment of the same length as the original. Typically 100 or 1,000 such alignments are generated from the initial alignment, and trees are generated from each. The reliability of a particular branching arrangement in a tree is judged by the frequency that the branch appears in all of the resulting trees.
The random sampling starts with the input alignment. In the standard tree generation process, the similarity matrix is generated by checking sequences against each other pairwise, tallying the similarity to each position in the alignment. Each position in the alignment is therefore counted once and only once. In a bootstrap sampling, a similarity matrix would be generated by comparing randomly selected positions in the alignment, so that some positions are compared more than once and some are not compared at all. For example, the starting alignment would be randomly sampled, ending up with a bootstrapped alignment that might look like this:
Figure 5.5 Example of a published phylogenetic tree with bootstrap values included. In this case, this is a maximum-likelihood tree, with bootstrap values from both maximum-likelihood (above the branches) and maximum-parsimony (below the branches) analyses shown. (Reprinted from Barns SM, Fundyga RE, Jeffries MW, Pace NR, Proc