Principles of Microbial Diversity. James W. Brown

Principles of Microbial Diversity - James W. Brown


Скачать книгу
for transitions) or determined by presifting the alignments to count the observed ratios of differences. doi:10.1128/9781555818517.ch5.f5.1

      In the Jukes and Cantor method, any difference in two sequences is scored equivalently; for each position in a pairwise comparison, the bases are either a match or they are not. A commonly used alternative is the Kimura two-parameter model, in which transitions (purine to purine or pyrimidine to pyrimidine) and transversions (purine to pyrimidine or pyrimidine to purine) are scored differently because transitions are much more common than transversions (Fig. 5.1). These scores are based on presifting the alignment to determine the relative frequency of transitions to transversions, and these different types of changes are scored accordingly. It is even possible to have a six-parameter model, in which each type of substitution (G:A, G:C, G:U, A:U, A:C, and U:C) is scored differently (Fig. 5.2).

      It is also possible to “weigh” the score of each position (column) in an alignment differently based on how conserved that position is; a difference in a conserved position is then scored as a greater difference than a difference in more variable positions. This requires alignments with many sequences so that variability at each position can be measured reliably, and so very often these are predetermined for the class of RNA being analyzed. The Weighbor algorithm used by the Ribosomal Database Project does this; the name stands for “weighted neighbor joining.” Distance matrices from protein alignments usually use a scoring table derived from the observed relative frequency with which any amino acid is substituted by another from a huge collection of aligned protein sequences, e.g., the PAM tables.

images

      Figure 5.2 A six-parameter substitution model scores each possible substitution differently. doi:10.1128/9781555818517.ch5.f5.2

      It is also difficult to deal with the fact that adjacent gaps are not independent. A string of gaps probably represents the insertion or deletion of more than one base at the same time, not the one-at-a-time insertion/deletion of individual bases. For example, a five-base string of gaps most likely represents a single insertion/deletion of five nucleotides, not five independent insertions/deletions of single nucleotides. Sophisticated algorithms use a large scoring penalty for a single gap but then only a very small additional penalty for additional adjacent gaps.

images

      Figure 5.3 An alignment showing two fundamentally different types of gaps. All of the gaps in the upper half of the sequences are indels; at least one sequence in the database has nucleotides at these positions, but these sequences do not. Some of the sequences in the bottom half of the alignment are partial sequences, i.e., sequence fragments that use gaps wherever there are no sequence data. doi:10.1128/9781555818517.ch5.f5.3

      The special case of GC bias

      Sometimes even rRNA sequences change adaptively—the bane of phylogenetic analysis. The most common example is the tendency of sequences to differ in G+C content, either because the genome has an unusual G+C content (i.e., there is pressure toward either G+C or A+T richness in the genome) or because the organism is a thermophile and so might prefer G=C over A=U base pairs in its RNAs. This can cause havoc in a tree. One way around this is to do a transversion analysis, which ignores transitions and only scores transversions. The common way to do this is simply to convert all of the A’s in the alignment to G’s and all U’s to C’s. Trees are generated from these alignments in the usual fashion. These trees are, of course, based on fewer data since more than half of the phylogenetic information in the alignment has been discarded, but they should be free of G+C bias artifacts.

       Long-branch attraction

      One of the things substitution models fight is a treeing artifact called long-branch attraction. Long-branch attraction is the result (primarily) of an underestimation of the evolutionary distance of distantly related sequences. This underestimation results in a tendency for the longest branches in a tree to artificially cluster together; this also results in the artificial clustering of short branches. Figure 5.4 shows a very simple demonstration of how long-branch attraction can result in incorrect trees.

      Long-branch attraction happens because of the difference in evolutionary rates in the branches. Therefore, it is always worth worrying about the details of trees containing branches with very different evolutionary rates, i.e., those with branches of very different lengths.

images

      Figure 5.4 Generation of a “long-branch attraction” artifact in a phylogenetic tree. If the sub-tree to the left is the representation of how these sequences are actually related, imagine what would happen in a neighbor-joining analysis. Sequences A and B are more alike (i.e., they have a smaller evolutionary distance between them) than either is to C, and so they will be erroneously joined, as shown on the right. doi:10.1128/9781555818517.ch5.f5.4

       Fitch-Margoliash: an alternative distance-matrix treeing method

      Another useful method for generating trees from distance matrices is that of Fitch and Margoliash, commonly called Fitch. This algorithm starts with two of the sequences, separated by a line equal to the length of the evolutionary distance between them. For example, for this distance matrix:


Скачать книгу
Librs.Net
Evolutionary distance
A B C D E
A