Supertrees: Algorithms and Databases

advertisement
Supertrees:
Algorithms and Databases
Roderic Page
University of Glasgow
r.page@bio.gla.ac.uk
DIMACS Working Group Meeting on Mathematical
and Computational Aspects Related to the Study of
The Tree of Life
What do we mean by the “Tree of Life”
Our perception of what the tree is may affect what we view
as being the “interesting” problems
or
Tree algorithms, models, genomics, Supertrees, datatypes,
lateral gene transfer
databases, taxonomy
Topics
• Supertrees (MinCut)
• Phylogenetic databases
Tree terminology
a
b
c
d
leaf
{a,b}
cluster
edge
{a,b,c}
{a,b,c,d}
internal node
root
Nestings and triplets
a
b
c
d
Nestings
{a,b} <T {a,b,c,d}
{b,c} <T {a,b,c,d}
Triplets
(bc)d
bc|d
Supertree
a
a
b
c
b
c
c
d
=
+
T1
b
T2
supertree
d
Some desirable properties of a
supertree method
(Steel et al., 2000)
• The supertree can be computed in
polynomial time
• A grouping in one or more trees that is
not contradicted by any other tree
occurs in the supertree
Aho et al.’s algorithm (OneTree)
Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D. 1981. Inferring a tree
from lowest common ancestors with an application to the optimization of
relational expressions. SIAM J. Comput. 10: 405-421.
Input: set of rooted trees
1. If set is compatible (i.e., will agree on a tree),
output that tree.
2. If set is not compatible, stop!
a
b
c
b
c
d
Aho et al.’s
OneTree
algorithm
T1
T2
a
a
b
a
b
b
c
a, b
d
a, b, c
c
d
c
a, b, c, d
supertree
Mincut supertrees
Semple, C., and Steel, M. 2000. A supertree method for
rooted trees. Discrete Appl. Math. 105: 147-158.
• Modifies OneTree by cutting graph
• Requires rooted trees (no analogue of
OneTree for unrooted trees)
• Recursive
• Polynomial time
a
b
c
d
e
T1
a
b
c
d
T2
b
a
c
S {T 1,T 2}
e
d
Semple and Steel (2000)
Collapsing the graph
(Semple and Steel mincut algorithm)
This edge
has
maximum
weight
b
2
a,b
1
1
a
c
c
1
1
e
1
S {T 1,T 2}
d
1
e
1
S {T 1,T2 }/E
d
max
{T 1,T2 }
Cut the graph to get supertree
a,b
a
1
b
c
d
c
1
e
1
S {T 1,T2 }/E
d
max
{T 1,T2 }
supertree
e
My mincut supertree implementation
darwin.zoology.gla.ac.uk/~rpage/supertree
• Written in C++
• Uses GTL (Graph Template Library) to
handle graphs (formerly a free alternative to
LEDA)
• Finds all mincuts of a graph faster than
Semple and Steel’s algorithm
A counter example:
two input trees...
a
c
b
b
c
x1
a
y1
y2
x2
y3
x3
y4
Mincut gives this (strange) result
• Disputed relationships
among a, b, and c are
resolved
• x1, x2, and x3
collapsed into
polytomy
c
x1
x2
x3
b
a
y1
y2
y3
y4
Problem:
Cuts depend on connectivity
(in this example it is a function of tree size)
y4
x3
y1
x2
b
y2
x1
y3
c
S {T 1,T 2}
a
So, mincut doesn’t work
• But, Semple and Steel said it did
• My program seems to work
• Argh!!! What is happening….?
What mincut does…
…and does not do
• Mincut supertree is guaranteed to include
any nesting which occurs in all input trees
• Makes no claims about nestings which
occur in only some of the trees
• “Does exactly what it says on the tin™”
Modifying mincut supertree
• Can we incorporate more of the information
in the input trees?
• Three categories of information
• Unanimous (all trees have that grouping)
• Contradicted (trees explicitly disagree)
• Uncontradicted (some trees have
information that no other tree disagrees
with)
Uncontradicted information
assume we have k input trees
a and b co-occur
in a tree
a and b nested
in a tree
n
c
a
b
a
b
c - n = 0  uncontradicted (if c = k then unanimous)
c - n > 0  contradicted
Uncontradicted information
assume we have k input trees
a and b co-occur
in a tree
a and b nested
in a tree
f
n
c
a
a and b in a fan
b
a
b
a
b
c - n -f = 0  uncontradicted (if c = k then unanimous)
c - n - f > 0  contradicted
Classifying edges
S {T 1,T 2}
x
y1
1
y2
x2
y1 y2
y3 y4
x1 x2
x3
b
b
y4
y3
x3
c
c
a
Uncontradicted
Uncontradicted but adjacent to contradicted
Contradicted
a
Modified mincut
• Species a, b, and c
form a polytomy
• x1, x2, and x3
resolved as per the
input tree
modified mincut
a
b
c
x1
x2
x3
y1
y2
y3
y4
If no tree contradicts an item of information, is that
information always in the supertree?
(23)5
(12)5
1
2
3
4
5
2
3
4
5
2
3
4
5
(45)1
(34)1
1
1
2
3
4
5
1
No!
Steel, Dress, & Böcker 2000
• The four trees display
(12)5, (23)5, (34)1,
and (45)1
• No tree displays (IK)J
or (JK)I for any (IJ)K
above
• Triplets are
uncontradicted, but
cannot form a tree
1
2
3
5
4
Future directions for supertrees
• Improve handling of uncontradicted
information
• Add support for constraints
• Visualising very big trees
• Better integration into phylogeny
databases (www.treebase.org)
darwin.zoology.gla.ac.uk/~rpage/supertree
Supertree Challenge
(proposed by Mike Sanderson mjsanderson@ucdavis.edu)
The TreeBASE database currently contains over 1000
phylogenies with over 11,000 taxa among them. Many of these
trees share taxa with each other and are therefore candidates for
the construction of composite phylogenies, or "supertrees", by
various algorithms. A challenging problem is the construction of
the largest and "best" supertree possible from this database.
"Largest" and "best" may represent conflicting goals, however,
because resolution of a supertree can be easily diminished by
addition of "inappropriate" trees or taxa.
It’s a scandal
• We cannot answer even the most basic question:
“what is the phylogeny for group x?”
• GenBank is currently the best phylogenetic
database (!)
• Can't even say how many species are in a given
group
• Little idea of who is doing what
Tree of Life
tolweb.org
• Provides text and
images
• Relies on extensive
manual effort (e.g.,
writing text)
• Can’t do any
computations with it
• Limited research value
TreeBASE
www.treebase.org
• Relational database
• Query by author,
taxon, study number
• Compute supertrees
• Submit NEXUS data
files
TreeBASE
TreeBASE and mincut supertrees
• User selects two or more
trees
• Clicks on button
and script on
darwin.zoology.gla.ac.uk
is run to create supertree
• Can view as PS, PDF,
treefile, or in Java applet
(ATV)
What’s wrong with TreeBASE?
• No consistency of taxon names
• (e.g., Human, Homo sapiens,
Homo sapiens X54666-1)
• No consistency of data names (e.g., gene
names, morphological characters, etc.)
The same organism may have multiple
names
www.all-species.org
“The ALL Species Foundation is a non-profit
organization dedicated to the complete inventory of
all species of life on Earth within the next 25 years - a
human generation.”
Press Release: November 13, 2002
Starting December 1, the ALL Species Foundation
will close its San Francisco office because of a lack
of funding for the Foundation.
The first challenge
• We need a taxonomic name server that can
resolve the name of any organism
• This server needs to reconcile multiple
classifications (e.g., GenBank, ITIS, etc.)
• Must handle at least 1 million names,
perhaps 100 million
Second Challenge
• How do we query trees?
• Trees can be classifications or phylogenies
SQL Queries on Trees
• Oracle SQL Transitive Closure Query
(recursion)
• Nested queries
• Node path queries
1. All ancestors of node A
A
2. Least Common Ancestor
(LCA) of A and B
A
B
3. Spanning Clade of A and B
A
B
4. Path Length from A and B
A
B
5
Node paths
/1/1/1/2
/1/1/2 /1/2/1 /1/2/2
/1/1/1/1
/2
/1/1/1
/1/2
/1/1
/1
Node paths - selecting subtree
/1/1/1/2
/1/1/2 /1/2/1 /1/2/2
/1/1/1/1
/2
/1/1/1
/1/2
/1/1
SELECT node
WHERE (path LIKE “/1/1/%”)
AND (path < “/1/10/%”);
/1
Node paths - selecting subtree
/1/1/1/2
/1/1/2 /1/2/1 /1/2/2
/1/1/1/1
/2
/1/1/1
/1/2
/1/1
SELECT node
WHERE (path LIKE “/1/1/%”)
AND (path < “/1/10/%”)
AND (num_children IS 0);
/1
Node paths - LCA
/1/1/1/2
/1/1/2 /1/2/1 /1/2/2
/1/1/1/1
/2
/1/1/1
/1/2
/1/1
Common substring starting
from left
/1
What do we do now…?
• Setup a taxonomic name server (TNS)
• Develop a phylogenetic genetic database
linked to TNS, PubMed, GenBank, etc.
• Develop easy ways to populate database
(e.g., from TreeBASE, GenBank, journal
databases)
• Develop standard set of tree queries
• Deploy
Download