phylogeny.reconstruction package

Submodules

phylogeny.reconstruction.allquartets module

All quartets method.

Given an n×n additive matrix M with n ≥ 5 associated to a binary tree T with positive branch lengths, we can construct T using a two-step technique that we now describe.

In Step 1, we compute a quartet tree on every four leaves by applying the Four Point Method to each 4×4 submatrix of M.

In Step 2, we assemble the quartet trees into a tree on the full set of leaves. Step 1 is straightforward. The technique we use in Step 2 is called the “All Quartets Method”.

phylogeny.reconstruction.allquartets.all_quartets(dist_matrix, names=None)[source]

Get all inferred quartet subtrees.

phylogeny.reconstruction.allquartets.all_quartets_method(dist_matrix, names=None)[source]

Reconstruct the tree from the dist. matrix using the all quartets method.

phylogeny.reconstruction.allquartets.four_point_method(additive, names=None)[source]

Method for inferring a tree from a 4x4 additive matrix.

If we are given a 4×4 additive matrix ‘D’ that corresponds to a tree ‘T’ with positive branch weights, then we can easily compute ‘T’ from ‘D’: We calculate the three pairwise sums from the four point condition, we determine which of the three pairwise sums is the smallest, and use that one to define the split for the four leaves into two sets of two leaves each.

phylogeny.reconstruction.allquartets.induced_quartet(dist_matrix, idx_quartet=None)[source]

Get the induced quartet ordering of 4 items.

phylogeny.reconstruction.allquartets.infer_siblings(quartets)[source]

From the tree quartets, infer pairs of sibling leafs.

We search for a pair x,y of leaves that is always together in any quartet that contains both x and y. (In other words, for all a,b, any quartet on {x,y,a,b} is ((x,y),(a,b))). Any pair of leaves that are siblings in the quartets tree T will satisfy this property.

phylogeny.reconstruction.allquartets.map_names_to_quartet(quartet, names=None)[source]

Map the names to the quartet’s indices.

phylogeny.reconstruction.allquartets.tree_from_quartets(quartets)[source]

From the given quartets, assemble the tree.

phylogeny.reconstruction.clocklike1 module

This module presents an algorithm for reconstructing the ultrametric tree corresponding to a matrix of ultrametric distances.

The algorithm is the one presented by Dan Gusfield in:

Additional information can be found on his book:

Algorithms on Strings, Trees, and Sequences.

Usage:

# An ultrametric matrix
>>> ultrametric = [ [0, 8, 8, 5, 3],
                    [8, 0, 3, 8, 8],
                    [8, 3, 0, 8, 8],
                    [5, 8, 8, 0, 5],
                    [3, 8, 8, 5, 0] ]
>>> nodes = ['A', 'B', 'C', 'D', 'E']

# Get the tree
>>> t = infer_clocklike_tree1(ultrametric, nodes)
>>> print(t)

             /-A
          /-|
       /-|   \-E
      |  |
    --|   \-D
      |
      |   /-B
       \-|
          \-C

Algorithm description:

“…here is another combinatorial algorithm that I claim is correct

and that does run in O(n^2) time. The algorithm is described in terms of a graph G, based on matrix D, but it can be implemented without an explicit graph.

Let each row i of matrix D be represented by a node i in G, and each edge (i,j) be given the value D(i,j). In O(n2) time, the algorithm will find a very particular path in graph G:

Set N equal to all the indices 1 through n; set L to the empty path; set i to any node.

Repeat n-1 times: begin remove i from N; find an index j in N such that D(i,j) <= D(i,k) for any k in N. place edge (i,j) in the path L; set i to j; end;

What this produces is a path L of exactly n edges, and the algorithm can be implemented in O(n2) time. It turns out that L is a minimum spanning tree of G, but that fact is not needed.

We will now use L to create the ultrametric tree recursively.

Concentrate on an edge (p,q) in the path L with the largest edge weight of all edges in L, and let P be the set of nodes at or to the left of p in L, and let Q be the set of nodes at or to the right of q in L. The fact that D is an ultrametric matrix implies that for any pair of nodes (i,j) where i is in P and j is in Q, D(i,j) = D(p,q). One way to prove this is by induction on the number of edges between i and j in L (applying the ultrametric condition that in any triangle, the max of the three edge weights is not unique). What this means is that in the ultrametric tree we are building (and in any ultrametric tree for D), any pair of leaves (i,j) where i is in P and j is in Q must have their least common ancestor at the root of the ultrametric tree, and that root must be labelled D(p,q).

If there are k > 1 ties for the global max edge weight in L, then removing those k edges creates k+1 subpaths of nodes, and applying the above argument, any two nodes i and j which are in different subpaths must have their least common ancestor at the root of the tree, which again must be labeled D(p,q). Hence, any ultrametric tree T for D must have exactly k+1 edges out of D, and the leaf set below any such edge must be exactly the (distinct) set of nodes in one of the k+1 subpaths.

No matter what k is, removing the k max weight edges in L, and partitioning N, takes only O(n) time.

To continue the description of the algorithm, we assume for convenience that k = 1. Let LP and LQ denote the two subpaths created by removing the max weight edge in L. Now we want to find an ultrametric tree for set P and one for set Q; these two ultrametric trees will then be attached to the root to creat the full ultrametric tree for D. But note that we already have the needed paths LP and LQ that would be created if we were to recursively apply the above method (clearly LP could result if we applied the path building algorithm to P alone, and similarly for LQ and Q). So we only need to find the max weight edge(s) in LP and the max weight edge(s) in LQ. Those two edges can be found in O(n) total time. Again, because the nodes were partitioned in the first step, this time bound holds even for k > 1.

Continuing, we build the ultrametric tree in O(n2) total time.

Note that at each step of the algorithm, the node partitions that are created, and the associated edges that are put into T, are forced. Hence if D is an ultrametric matrix, the ultrametric tree T for D is unique.

” - Dan Gusfield.

phylogeny.reconstruction.clocklike1.draw_graph(g)[source]

Display an edge-weighted graph.

phylogeny.reconstruction.clocklike1.get_graph(ultrametric, nodes)[source]

From an ultrametric matrix, get the weights graph.

phylogeny.reconstruction.clocklike1.get_path(g, ultrametric, nodes)[source]

From the weights graph, get the path.

phylogeny.reconstruction.clocklike1.infer_clocklike_tree1(ultrametric, node_names=None)[source]
phylogeny.reconstruction.clocklike1.max_edge_weight(L)[source]

Return the edge with the largest weight.

phylogeny.reconstruction.clocklike1.path_to_tree(P)[source]
phylogeny.reconstruction.clocklike1.weights_for(matrix, nodes)[source]

Get the entries by node name.

phylogeny.reconstruction.clocklike2 module

Module implementing phylogeny estimation assuming clocklike evolution.

“An assumption that is sometimes made is that sequence evolution is clocklike (also referred to as obeying the strict molecular clock), which means that the expected number of changes is proportional to time. If we assume that the leaves represent extant (i.e., living) species, then under the assumption of a strict molecular clock, the total expected number of changes from the root to any leaf is the same. Under the assumption of a strict molecular clock, the matrix of expected distances between the leaves in the tree has properties that make it “ultrametric”.”

– From the book: “Computational Phylogenetics. An introduction
to designing methods for phylogeny estimation” by Tandy Warnow
phylogeny.reconstruction.clocklike2.infer_clocklike_tree2(distances)[source]

Assumming the sequences evolved in a clocklike process, infer the tree.

“One very natural approach to estimating the tree would be to select as siblings the two sequences that are the most similar to each other from the three possible pairs. Because the sequence evolution model is clocklike, this technique will correctly construct rooted three-leaf trees with high probability. Furthermore, the method can even be extended to work on trees with more than three leaves, using recursion…

Hence, to estimate this tree, we would first compare all pairs of sequences to find which pair is the most similar, and we’d select ‘a’ and ‘b’ as this pair. We’d then correctly infer that species ‘a’ and ‘b’ are siblings. We could then remove one of these two sequences (say ‘a’), and reconstruct the tree on what remains. Finally, we would add ‘a’ into the tree we construct with the other vertices, by making it a sibling to ‘b’.”

– From the book.

Module contents