An introduction to maximum parsimony and compatibility

advertisement
An introduction to maximum
parsimony and compatibility
Trevor Bruen
PhD Candidate
McGill Centre for Bioinformatics
Overview
•
The point of this talk is to give a sense how discrete mathematics enters
into phylogenetic and genetic inference.
•
I will illustrate these ideas by describing two approaches in detail namely
maximum compatibility and maximum parsimony.
•
I will also show how ideas from these two criteria can be used to develop
applications such as bounds and tests for recombination.
•
My goal is to give the basis for further study in this type of area and to give
greater insight into these methods.
Outline
•
•
Introduction to compatibility and parsimony
Overview of basic notation/concepts
•
Compatibility
• Compatibility as a graph theory problem
• Compatibility for pairs of characters
• Interpretation of compatibility
•
Parsimony
• Parsimony score with connections to graph theory
• Connections between parsimony and compatibility
• Homoplasy
• Parsimony for pairs of characters
• Connections between SPRs/TBRs and parsimony
• Applications to recombination
• Parsimony as a consensus method
Introduction
•
Maximum parsimony and maximum compatibility that are used in phylogenetics,
linguistics and population genetics
• Phylogenetics goal is to infer an evolutionary tree
• Linguistics often the same
• Population genetics uses compatibility for recombination
•
For general phylogenetic inference with molecular data, likelihood (probability based)
methods are generally preferred.
•
BUT compatibility and parsimony are computationally tractable.
•
ALSO the mathematics behind parsimony and compatibility is very well developed.
We can show that parsimony=likelihood in certain circumstances (Tuffley and Steel
1997). This gives us insight in where to go in terms of research.
Formalism
•
A character is a mapping from a set of
taxa to a set of states.
•
In this case, X={S1,S2,S3,S4}
•
Also, C={A,C}
•
Informally, a character is a “column” in
a multiple sequence alignment
Binary Character / Splits
•
If character has two states then it induces a
split of the taxa set.
•
Example: Let X be the taxa set
{S1,S2,S3,S4}. Let C be the state set {A,C}.
•
Then {S1,S2} | {S3,S4} is the split induced
by the first character.
•
In general a character induces a set of
equivalence classes
Tree and Labeling
•
Informally we would like to be able to
mathematically describe a tree and a
labeling structure.
•
In graph theory a tree T=(V,E) consists of a
graph with no cycles.
•
Informally, we would also like to be able to
add taxa (members of X) to our tree
(actually the leaves).
•
Define a labeling function (such that leaves
of V(T) are labeled by members of X):
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
X-Trees
•
An X-tree consists of pair: (T, phi)
where phi is a labeling function that
labels the leaves of T.
•
Recall:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Extensions
•
Informally, we have an X-tree
consisting of the pair (T,phi). We also
have a character chi. We need to
relate the character to the tree.
•
Define an extension of character as a
function (which is consistent at the
leaves with chi):
•
Informally, an extension provides a
description of how the internal vertices
are labeled.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Quick Summary
•
Summary so far:
• X-tree are trees along with
functions labeling the leaves with
members of X
• A character is a function from X
into a state set C
• An extension is a labeling of the
vertices of T with states of C
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Compatibility - Definition
•
A character is compatible with a tree if
and only if there exists an extension of
the character to the tree so that the
subgraphs induced by each of the
states are connected.
•
Example:
• First tree character is compatible
with tree
• Second tree character is
incompatible since both A’s are
disconnected
Compatibility
•
Problem definition: Given a sequence of characters
determine whether there exists a tree on which all character are compatible.
•
Related problem: Given a sequence of characters
determine largest set of characters that are compatible with some tree
Intersection Graph
• Suppose we have sequence of
characters
where
•
Then each character induces a
partition of X - I.e.
•
Create a graph where the vertex set
consists of
•
There is an edge between two vertices
iff only the intersection of the two
subsets are non-empty
Intersection Graph
•
To figure out whether the sequence of
characters
are compatible, we will be able to
determine this directly from the
intersection graph.
•
First we need to define two concepts:
a chordal graph and a restricted
chordal completion of the intersection
graph.
Chordal Graphs
•
A graph G=(V,E) is chordal graph if
every cycle with at least four vertices
contains a chord (an edge connecting
two non-consecutive vertices).
•
A chordalization of graph is a graph
G’=(V,E’) where
such
that G’ is chordal
Restricted Chordal Completions
•
Imagine the vertices of our graph
G=(V,E) are colored. Then a restricted
chordalization of G is a graph
G’=(V,E’), where G’ is chordal but all
edges of G connect vertices of
different colors.
Restricted chordal completions
•
A restricted chordal completion of the
intersection graph is a chordalization
where there is no edge between
vertices that share the same character.
•
In this case, the “colors” correspond to
characters
Main Theorem for Compatibility
•
Let
be a
collection of characters. Then
is
compatible if and only if there is a
restricted chordal completion of the
intersection graph.
Pairs of Characters
•
A simple corollary of main theorem arises
when we restrict our attention to two
characters.
•
Corollary: Two characters
are compatible if and only if the intersection
graph, G for both characters is acyclic
•
Proof: (backwards direction) If graph is
acyclic then it is chordal so the characters
are compatible.
(forward direction) OTOH Suppose G
contains a cycle. Then any chordal
completion of G must contain a three cycle.
But no restricted completion of G can
contain a three cycle! So G is acyclic.
Interpretation
•
Recall: a set of characters are
compatible with a X-tree if and only if
there exists an extension of the
character to the tree so that the
subgraphs induced by each of the
states are connected.
•
Informally speaking this is a very strict
condition. This corresponds to an “all
or nothing” condition - either a
character is compatible with a tree or it
isn’t. Relaxing this condition is the
subject of the next section.
Parsimony
•
Informally: given an leaf labeled tree
and a character, how can we define
the fit of the character to the tree?
•
Consider a character,
along with
an extension
to a leaf labeled tree.
Then the length of the extension is the
number edges where
•
Define the parsimony score of a
character on a tree as the length of a
minimal extension of the character to
the tree. Denote this value by
Parsimony
• Then the maximum parsimony score for a set of characters
on a tree
is defined as:
•
The tree that minimizes this score is referred to as the maximum parsimony
tree.
Parsimony and graph theory
•
A minimal cut-set for a leaf-labeled tree T=(V,E) and a character
is a minimal set
of edges whose removal ensure that if
that x and y are in different
components.
•
Claim: There is a bijection between the set of minimal cut sets and minimal
extensions. So the cardinality of the minimal cut set is equal to the parsimony score.
Parsimony and Graph Theory
•
Recall Menger’s Theorem (1927): Let G=(V,E) be a graph with V1 and V2 as two
disjoint subsets of V. Then the minimum number of edges whose removal from G
leaves vertices of V1 and V2 in different components is equal to the maximum number
of edge disjoint paths between V1 and V2.
•
Corollary: For a binary character, the maximal number of edge disjoint paths
corresponds to the parsimony score.
Compatibility and parsimony
•
Recall: let
be
a collection of characters. Then
is compatible if and only if there is a
restricted chordal completion of the
intersection graph.
•
Question: How can characterize
parsimony with respect to an
intersection graph?
Compatibility Graph
•
Recall: Each character induces a
partition of X - I.e.
•
A block
for a character
is a subset taxa on which
is
constant.
•
Thus we may identify the blocks of
with the vertices of the intersection
graph.
Character Refinement
•
•
A character
character
implies
refines another
if
Thus characters that refine other
characters correspond to refinements
of the partition
Compatibility and Parsimony
•
•
Recall: Let
be a collection of characters. Then
is
compatible if and only if there is a restricted chordal completion of the intersection
graph.
Main:
Special Case: Two characters
•
Recall: Two characters are compatible
if and only if the intersection graph, G
for both characters is acyclic
•
Using the previous theorem we can
show that the parsimony score for two
characters corresponds to:
where k is the number of components in
the graph.
•
Note: This score corresponds to the
maximum parsimony score over all
trees.
Homoplasy
•
Recall: The parsimony score of a
character on a tree,
corresponds to minimum number of
changes of a character on a tree.
•
Informally: What is an intuitive way to
think about the parsimony score?
•
Define the homoplasy of character on
a tree as
Homoplasy
•
Note that
if and only if
•
Informally: Homoplasy corresponds to
the number of “extra” mutations of the
character on the tree. These “extra”
mutations correspond to recurrent
mutations
•
Informally: Thus a character is not
compatible on a tree iff it cannot be
placed on a tree without “extra”
mutations.
with equality
is convex on T
Homoplasy For Two Characters
•
Recall: The parsimony score for a pair
of characters can be found directly
from the bipartite intersection graph.
•
Recall: This score corresponds to an
optimum over all trees.
•
Thus for two characters, we can define
a pairwise homoplasy score as
•
Recall: Up to now homoplasy refers to
“extra” mutations on a tree.
A second look at homoplasy
•
Example: Two characters with a pairwise
homoplasy score equal to one.
•
Informally: We have seen that the
homoplasy corresponds to the number of
“extra” mutations on a tree.
•
But in certain situations, this is biologically
implausible. The state 1 may correspond to
a mutation that has only arisen once. In this
case, the fact that the pairs of characters are
incompatible can be explained by a
recombination event.
•
This will be defined more precisely later.
A quick aside - tree distances.
•
Differences between leaf labeled trees
can be defined using various metrics e.g. Subtree Prune and Regrafts
•
A “subtree prune and regraft”
corresponds to a specific rearrangement of a tree.
•
For two leaf-labeled trees, dSPR(T1, T2)
is minimum #SPRs between T1 and T2
Homoplasy for two characters
•
Theorem: If
and
are two
characters then
corresponds to the minimum number
of SPRs from any leaf-labled tree on
which
is compatible to any leaf
labeled tree on which
is
compatible!
•
Informally: Thus we have a whole new
interpretation of homoplasy.
Application - Testing
for Recombination
•
If recombination has occurred sites will
have different histories
• Nearby sites will tend to have
“greater” genealogical correlation
than distant sites
•
Idea: If recombination has occurred,
genealogical correlation will be
partially reflected by a tendency for
pairs of closely linked sites to have
than less homoplasy than distant sites
Test for Recombination
•
Idea: We would like to distinguish
between two possibilities - recurrent
mutation and recombination.
•
Idea: Use previous observations to
develop test for recombination.
• H0: Single history describe all
sites.
• H0 ’ : Nearby sites share no more
compatibility than arbitrary pairs of
sites
•
Use
statistic to capture
information and solve analytically for
p-values
Application: Parsimony and
supertrees
•
Supertree: MRP - parsimony with
characters that represent trees.
•
What does homoplasy mean in this
context?
Courtesy of TREE 12:315-322
Parsimony as a consensus tree
•
Recall: If
and
are two
characters then
corresponds to the minimum number
of SPRs from any leaf-labeled tree on
which
is compatible to any leaf
labeled tree on which
is
compatible.
•
Informally: This can be generalized to
show that the maximum parsimony
tree for a set of charaters
minimizes the SPR distance to each of
the set of tree on which each character
is compatible…
Acknowledgements
•
Thanks for listening!
•
Background and further reading:
•
Phylogenetics, Semple and Steel
(book 2003)
•
Some results I presented are not on
this book - they are from work I have
worked on. Please talk to me if you
are interested.
•
I have many other references- please
see me if interested.
Download