Principles of Microbial Diversity. James W. Brown
As shown graphically in Fig. 4.1, similarity and distance are very closely related initially (e.g., 0.90 similarity ≈ 0.10 distance) but level off to 0.25 similarity, where evolutionary distance is infinite. This makes sense; for two sequences that are very similar, the probable frequency of more than one change at a single site is low, requiring only a small correction, whereas two sequences that have changed beyond all recognition (infinite evolutionary distance) are still approximately 25% similar just because there are only four bases and so approximately one of the four will match entirely by chance.
Figure 4.1 The Jukes and Cantor equation plotted as observed sequence similarity (from the similarity matrix) versus estimated evolutionary distance. doi:10.1128/9781555818517.ch4.f4.1
To convert a similarity matrix to a distance matrix, just convert each value in the similarity matrix to evolutionary distance using either the graph or the equation. In our example:
Generating a tree from a distance matrix
In the neighbor-joining method, the structure of the tree is determined first and then the branch lengths are fit to this skeleton.
Solving the tree structure
The tree starts out with a single internal node and a branch out to each sequence: an n-pointed star, where n is the number of sequences in the alignment. The pair of sequences with the smallest evolutionary distance separating them is joined onto a single branch (i.e., the neighbors are joined, hence the name of the method), and then the process is repeated after merging these two sequences in the distance matrix by averaging their distances from every other sequence in the matrix.
Using our distance matrix, the tree starts out like this (remember that we are sorting out the structure of the tree, not yet the branch lengths).
The closest neighbors in the distance matrix are A and B (0.11 evolutionary distance), so these branches are joined:
The distances to A and B from each of the other sequences are then averaged to reduce the distance matrix:
In this case, the averages are trivial; the average of 0.30 and 0.30 is, of course, 0.30, and so forth. This is not usually the case. Then the process is repeated. The closest neighbors in the reduced matrix are D with C (0.17):
Once again, the distance matrix is reduced by averaging. But be sure to average from the original distance matrix, not the previously reduced matrix:
Note that the value listed for A/B with C/D (0.265) is the average of four values in the original matrix: A to C (0.30), A to D (0.23), B to C (0.30), and B to D (0.23). If averaged AB/C (0.30) and AB/D (0.23) were averaged from the reduced matrix, the same number would be obtained in this case, but usually this is a more complex average and the numbers do not come out the same.
Once again, the smallest number in the matrix represents the nearest neighbors, in this case A/B with C/D (0.265), so these two branches are joined:
Each node on this tree has only three branches connecting to it; all of the nodes are completely resolved. This means that the structure of the tree has been determined. If there were more sequences, it would be necessary to reduce the matrix (joining A/B with C/D) and repeat the process until all of the nodes were resolved. The internal nodes have been arbitrarily labeled w, x, y, and z for reference when sorting out branch lengths (below).
Determining branch length
The next step is to determine the lengths of the branches on this tree. Basically, this is done by going through each node and finding where along the branch it is by figuring out the average difference in length along each of two branches. By choosing various sets of three sequences in a tree, the branch lengths can be sorted out just like a puzzle.
Branches w/A and w/B (w is the common node between A and B)
In our example, the distance between A and B is 0.11, and so the lengths of the two branches connecting them (A/w and w/B) must add up to 0.11. But where along this branch is the node (w)? If you look at the distance from any other sequence (C, D, or F) to A and to B, it is always the same. For example, C/A is 0.30 and so is C/B. This means that node w must be midway between A and B; each branch, then, has a length of 0.055. For example, with C used as reference:
Branches x/C and x/D
C and D are also simple neighbors, so we can easily solve these two connecting branches as well. The distance between C and D is 0.17. However, the distance to either C or D from elsewhere in the tree differs; this means that the node connecting C and D is not equidistant between them. In fact, this difference in distance between C and D varies a bit depending on which reference we use. These differences occur because we can only estimate evolutionary distance. As a result, we use the average (although most computer algorithms would use a least-squares average):
Therefore, 0.07 is the difference in branch length between A/C and A/D
This value, 0.08, is the average amount by which x/C is longer than x/D. Because the total length of C/D (think of this as C/x/D) is 0.170 and x/C is 0.08 longer than x/D, then:
x/C is the same 0.045 plus the 0.08 by which we already determined it was longer = 0.125:
Branches z/E and z/F
The distance between E and F can be calculated in the same way:
The length of E/z/F (E/F) is 0.30, and the average difference between anywhere else in the tree and E or F is −0.08. The negative sign means that the branch to F is longer than the branch to E. So z/E = (0.30 − 0.08)/2 = 0.11, and z/F − 0.11 + 0.08 = 0.1.
Internal branches w/x, x/y, and x/z
The tree so far is: