An introduction to maximum parsimony and compatibility Trevor Bruen PhD Candidate McGill Centre for Bioinformatics Overview • The point of this talk is to give a sense how discrete mathematics enters into phylogenetic and genetic inference. • I will illustrate these ideas by describing two approaches in detail namely maximum compatibility and maximum parsimony. • I will also show how ideas from these two criteria can be used to develop applications such as bounds and tests for recombination. • My goal is to give the basis for further study in this type of area and to give greater insight into these methods. Outline • • Introduction to compatibility and parsimony Overview of basic notation/concepts • Compatibility • Compatibility as a graph theory problem • Compatibility for pairs of characters • Interpretation of compatibility • Parsimony • Parsimony score with connections to graph theory • Connections between parsimony and compatibility • Homoplasy • Parsimony for pairs of characters • Connections between SPRs/TBRs and parsimony • Applications to recombination • Parsimony as a consensus method Introduction • Maximum parsimony and maximum compatibility that are used in phylogenetics, linguistics and population genetics • Phylogenetics goal is to infer an evolutionary tree • Linguistics often the same • Population genetics uses compatibility for recombination • For general phylogenetic inference with molecular data, likelihood (probability based) methods are generally preferred. • BUT compatibility and parsimony are computationally tractable. • ALSO the mathematics behind parsimony and compatibility is very well developed. We can show that parsimony=likelihood in certain circumstances (Tuffley and Steel 1997). This gives us insight in where to go in terms of research. Formalism • A character is a mapping from a set of taxa to a set of states. • In this case, X={S1,S2,S3,S4} • Also, C={A,C} • Informally, a character is a “column” in a multiple sequence alignment Binary Character / Splits • If character has two states then it induces a split of the taxa set. • Example: Let X be the taxa set {S1,S2,S3,S4}. Let C be the state set {A,C}. • Then {S1,S2} | {S3,S4} is the split induced by the first character. • In general a character induces a set of equivalence classes Tree and Labeling • Informally we would like to be able to mathematically describe a tree and a labeling structure. • In graph theory a tree T=(V,E) consists of a graph with no cycles. • Informally, we would also like to be able to add taxa (members of X) to our tree (actually the leaves). • Define a labeling function (such that leaves of V(T) are labeled by members of X): QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. X-Trees • An X-tree consists of pair: (T, phi) where phi is a labeling function that labels the leaves of T. • Recall: QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Extensions • Informally, we have an X-tree consisting of the pair (T,phi). We also have a character chi. We need to relate the character to the tree. • Define an extension of character as a function (which is consistent at the leaves with chi): • Informally, an extension provides a description of how the internal vertices are labeled. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Quick Summary • Summary so far: • X-tree are trees along with functions labeling the leaves with members of X • A character is a function from X into a state set C • An extension is a labeling of the vertices of T with states of C QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Compatibility - Definition • A character is compatible with a tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. • Example: • First tree character is compatible with tree • Second tree character is incompatible since both A’s are disconnected Compatibility • Problem definition: Given a sequence of characters determine whether there exists a tree on which all character are compatible. • Related problem: Given a sequence of characters determine largest set of characters that are compatible with some tree Intersection Graph • Suppose we have sequence of characters where • Then each character induces a partition of X - I.e. • Create a graph where the vertex set consists of • There is an edge between two vertices iff only the intersection of the two subsets are non-empty Intersection Graph • To figure out whether the sequence of characters are compatible, we will be able to determine this directly from the intersection graph. • First we need to define two concepts: a chordal graph and a restricted chordal completion of the intersection graph. Chordal Graphs • A graph G=(V,E) is chordal graph if every cycle with at least four vertices contains a chord (an edge connecting two non-consecutive vertices). • A chordalization of graph is a graph G’=(V,E’) where such that G’ is chordal Restricted Chordal Completions • Imagine the vertices of our graph G=(V,E) are colored. Then a restricted chordalization of G is a graph G’=(V,E’), where G’ is chordal but all edges of G connect vertices of different colors. Restricted chordal completions • A restricted chordal completion of the intersection graph is a chordalization where there is no edge between vertices that share the same character. • In this case, the “colors” correspond to characters Main Theorem for Compatibility • Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. Pairs of Characters • A simple corollary of main theorem arises when we restrict our attention to two characters. • Corollary: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic • Proof: (backwards direction) If graph is acyclic then it is chordal so the characters are compatible. (forward direction) OTOH Suppose G contains a cycle. Then any chordal completion of G must contain a three cycle. But no restricted completion of G can contain a three cycle! So G is acyclic. Interpretation • Recall: a set of characters are compatible with a X-tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. • Informally speaking this is a very strict condition. This corresponds to an “all or nothing” condition - either a character is compatible with a tree or it isn’t. Relaxing this condition is the subject of the next section. Parsimony • Informally: given an leaf labeled tree and a character, how can we define the fit of the character to the tree? • Consider a character, along with an extension to a leaf labeled tree. Then the length of the extension is the number edges where • Define the parsimony score of a character on a tree as the length of a minimal extension of the character to the tree. Denote this value by Parsimony • Then the maximum parsimony score for a set of characters on a tree is defined as: • The tree that minimizes this score is referred to as the maximum parsimony tree. Parsimony and graph theory • A minimal cut-set for a leaf-labeled tree T=(V,E) and a character is a minimal set of edges whose removal ensure that if that x and y are in different components. • Claim: There is a bijection between the set of minimal cut sets and minimal extensions. So the cardinality of the minimal cut set is equal to the parsimony score. Parsimony and Graph Theory • Recall Menger’s Theorem (1927): Let G=(V,E) be a graph with V1 and V2 as two disjoint subsets of V. Then the minimum number of edges whose removal from G leaves vertices of V1 and V2 in different components is equal to the maximum number of edge disjoint paths between V1 and V2. • Corollary: For a binary character, the maximal number of edge disjoint paths corresponds to the parsimony score. Compatibility and parsimony • Recall: let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. • Question: How can characterize parsimony with respect to an intersection graph? Compatibility Graph • Recall: Each character induces a partition of X - I.e. • A block for a character is a subset taxa on which is constant. • Thus we may identify the blocks of with the vertices of the intersection graph. Character Refinement • • A character character implies refines another if Thus characters that refine other characters correspond to refinements of the partition Compatibility and Parsimony • • Recall: Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. Main: Special Case: Two characters • Recall: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic • Using the previous theorem we can show that the parsimony score for two characters corresponds to: where k is the number of components in the graph. • Note: This score corresponds to the maximum parsimony score over all trees. Homoplasy • Recall: The parsimony score of a character on a tree, corresponds to minimum number of changes of a character on a tree. • Informally: What is an intuitive way to think about the parsimony score? • Define the homoplasy of character on a tree as Homoplasy • Note that if and only if • Informally: Homoplasy corresponds to the number of “extra” mutations of the character on the tree. These “extra” mutations correspond to recurrent mutations • Informally: Thus a character is not compatible on a tree iff it cannot be placed on a tree without “extra” mutations. with equality is convex on T Homoplasy For Two Characters • Recall: The parsimony score for a pair of characters can be found directly from the bipartite intersection graph. • Recall: This score corresponds to an optimum over all trees. • Thus for two characters, we can define a pairwise homoplasy score as • Recall: Up to now homoplasy refers to “extra” mutations on a tree. A second look at homoplasy • Example: Two characters with a pairwise homoplasy score equal to one. • Informally: We have seen that the homoplasy corresponds to the number of “extra” mutations on a tree. • But in certain situations, this is biologically implausible. The state 1 may correspond to a mutation that has only arisen once. In this case, the fact that the pairs of characters are incompatible can be explained by a recombination event. • This will be defined more precisely later. A quick aside - tree distances. • Differences between leaf labeled trees can be defined using various metrics e.g. Subtree Prune and Regrafts • A “subtree prune and regraft” corresponds to a specific rearrangement of a tree. • For two leaf-labeled trees, dSPR(T1, T2) is minimum #SPRs between T1 and T2 Homoplasy for two characters • Theorem: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labled tree on which is compatible to any leaf labeled tree on which is compatible! • Informally: Thus we have a whole new interpretation of homoplasy. Application - Testing for Recombination • If recombination has occurred sites will have different histories • Nearby sites will tend to have “greater” genealogical correlation than distant sites • Idea: If recombination has occurred, genealogical correlation will be partially reflected by a tendency for pairs of closely linked sites to have than less homoplasy than distant sites Test for Recombination • Idea: We would like to distinguish between two possibilities - recurrent mutation and recombination. • Idea: Use previous observations to develop test for recombination. • H0: Single history describe all sites. • H0 ’ : Nearby sites share no more compatibility than arbitrary pairs of sites • Use statistic to capture information and solve analytically for p-values Application: Parsimony and supertrees • Supertree: MRP - parsimony with characters that represent trees. • What does homoplasy mean in this context? Courtesy of TREE 12:315-322 Parsimony as a consensus tree • Recall: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labeled tree on which is compatible to any leaf labeled tree on which is compatible. • Informally: This can be generalized to show that the maximum parsimony tree for a set of charaters minimizes the SPR distance to each of the set of tree on which each character is compatible… Acknowledgements • Thanks for listening! • Background and further reading: • Phylogenetics, Semple and Steel (book 2003) • Some results I presented are not on this book - they are from work I have worked on. Please talk to me if you are interested. • I have many other references- please see me if interested.