Working with Trees in the Phyloinformatic Age William H. Piel Yale Peabody Museum

advertisement
Working with Trees in the
Phyloinformatic Age
William H. Piel
Yale Peabody Museum
Hilmar Lapp
NESCent, Duke University
Dealing with the Growth of Phyloinformatics
• Trees: Too Many
– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees: Too Big
– Visualizing and manipulating large trees
• Demo PhyloWidget
Searching Stored Tree
•
•
•
•
Path Enumerations
Nested Sets
Adjacency Lists
Transitive Closure
Dewey system:
0.1.1
A
0.1
0.1.2
B
0.2.1.1
C
0.2.1.2
D
0.2.1
0.2
0
0.2.2
E
Find clade for: Z = (<CS+Ds)
Label
Root
NULL
Path
0
0.1
A
B
NULL
0.1.1
0.1.2
0.2
NULL
C
D
0.2.1
0.2.1.1
0.2.1.2
E
0.2.2
A
B
C
D
Find common pattern
starting from left
SELECT *
FROM nodes
WHERE (path LIKE “0.2.1%”);
E
• ATreeGrep
– Uses special suffix indexing to optimize speed
– Shasha, D., J. T. L. Wang, H. Shan and K. Zhang. 2002.
ATreeGrep: Approximate Searching in Unordered Tree.
Proceedings of the 14th SSDM, Edinburgh, Scotland,
pp. 89-98.
• Crimson
– Uses nested subtrees to avoid long strings
– Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim, and S. B.
Davidson. 2006. Crimson: A Data Management System
to Support Evaluating Phylogenetic Tree
Reconstruction Algorithms. 32nd International
Conference on Very Large Data Bases, ACM, pp. 12311234.
Searching Stored Tree
•
•
•
•
•
Path Enumerations
Nested Sets
Adjacency Lists
Metrics
Transitive Closure
Depth-first traversal scoring each node with a lef and right ID
A
3
B
4
5
2
7
C
6
10
D
11
9
12
18
13
14
8
1
E
17
15
16
Minimum Spanning Clade of Node 5
Label
Left
Right
1
18
2
7
A
3
4
B
5
6
8
17
9
14
C
10
11
D
12
13
E
15
16
A
3
B
4
5
2
7
C
6
10
D
11 12
9
13 15
16
14
8
1
E
17
18
SELECT *
FROM nodes
INNER JOIN nodes AS include
ON (nodes.left_id BETWEEN include.left_id
AND include.right_id)
WHERE include.node_id = 5 ;
• PhyloFinder
– Duhong Chen et al.
– http://pilin.cs.iastate.edu/phylofinder/
• Mackey, A. 2002. Relational Modeling of Biological Data:
Trees and Graphs. Bioinformatics Technology Conference.
http://www.oreillynet.com/pub/a/network/2002/11/27/bioc
onf.html
Searching Stored Tree
•
•
•
•
•
Path Enumerations
Nested Sets
Adjacency Lists
Metrics
Transitive Closure
A
B
C
D
E
3
4
7
8
9
2
2
6
6
5
-
6
5
-
-
2
5
1
1
1
-
A
B
3
4
C
7
2
D
E
8
9
6
5
1
node_label:
-
-
A
B
-
-
C
D
E
node_id: 1
2
3
4
5
6
7
8
9
1
2
2
1
5
6
6
5
parent_id:
-
SQL Query to find parent node of node “D”:
SELECT *
FROM nodes AS parent
INNER JOIN nodes AS child
ON (child.parent_id = parent.node_id)
WHERE child.node_label = ‘D’;
…but this requires an external procedure to navigate the tree.
Searching Stored Tree
•
•
•
•
•
Path Enumerations
Nested Sets
Adjacency Lists
Metrics
Transitive Closure
Searching trees by distance metrics: USim distance
Wang, J. T. L., H. Shan, D. Shasha and W. H. Piel. 2005. Fast Structural Search in
Phylogenetic Databases. Evolutionary Bioinformatics Online, 1: 37-46
A
B
C
D
A
B
C
A B C D
A B C D
A
0
1
2
3
A
0
1
2
2
B
1
0
2
3
B
1
0
2
2
C
1
1
0
2
C
2
2
0
1
D
1
1
1
0
D
2
2
1
0
D
Searching Stored Tree
•
•
•
•
Path Enumerations
Nested Sets
Adjacency Lists
Transitive Closure
Transitive Closure
• Finding paths between vertices on a graph
• DB2 and Oracle have special functions:
– From Edge
Start With (child_id = A and tree_id = T)
Connect By (Prior parent_id = child_id)
And (Prior tree_id = tree_id)
• Nakhleh, L., D. Miranker, F. Barbancon, W. H.
Piel, and M. Donoghue. 2003. Requirements of
phylogenetic databases. Third IEEE Symposium
on Bioinformatics and Bioengineering, p. 141148.
• Paths can be precomputed and stored: BioSQL
Dealing with the Growth of Phyloinformatics
• Trees Too Many
– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big
– Visualizing and manipulating large trees
• Demo PhyloWidget
BioSQL: http://www.biosql.org/
Schema for persistent storage of sequences and features tightly
integrated with BioPerl (+ BioPython, BioJava, and BioRuby)
• phylodb extension designed at NESCent Hackathon
• perl command-line interface by Jamie Estill, GSoC
Index of all paths from ancestors to descendants
CREATE TABLE node_path (
child_node_id integer,
parent_node_id integer,
distance integer);
A
B
3
4
3
2
4
4
3
2
5
2
2
1
1
C
5
1
1
1
Find all paths where A and B share a common parent_node_id
SELECT pA.parent_node_id
FROM node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id = pB.parent_node_id
AND pA.child_node_id = nA.node_id
AND nA.node_label = 'A'
AND pB.child_node_id = nB.node_id
AND nB.node_label = 'B';
A
B
3
4
3
2
4
4
3
2
5
2
2
1
1
C
5
1
1
1
…of those paths, select one that has the shortest path
A
B
3
SELECT pA.parent_node_id
FROM node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id = pB.parent_node_id
AND pA.child_node_id = nA.node_id
AND nA.node_label = 'A'
AND pB.child_node_id = nB.node_id
AND nB.node_label = 'B'
ORDER BY pA.distance
LIMIT 1;
4
3
2
4
4
3
2
5
2
2
1
1
C
5
1
1
1
…of those paths, select one that has the longest path
A
B
3
SELECT pA.parent_node_id
FROM node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id = pB.parent_node_id
AND pA.child_node_id = nA.node_id
AND nA.node_label = 'A'
AND pB.child_node_id = nB.node_id
AND nB.node_label = 'B'
ORDER BY pA.distance DESC
LIMIT 1;
4
3
2
4
4
3
2
5
2
2
1
1
C
5
1
1
1
Find the maximum spanning clade (i.e. the subtree) for each tree that
includes A and B but not C:
Return an adjacency list for each subtree
SELECT e.parent_id AS parent, e.child_id AS child, ch.node_label, pt.tree_id
FROM node_path p, edges e, nodes pt, nodes ch
WHERE e.child_id = p.child_node_id
AND pt.node_id = e.parent_id
AND ch.node_id = e.child_id
AND p.parent_node_id IN (
SELECT pA.parent_node_id
Get all
FROM node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id = pB.parent_node_id
ancestors
AND pA.child_node_id = nA.node_id
shared by
AND nA.node_label = 'A'
AND pB.child_node_id = nB.node_id
A and B
AND nB.node_label = 'B')
AND NOT EXISTS (
SELECT 1 FROM node_path np, nodes n
WHERE np.child_node_id = n.node_id
Exclude those
AND n.node_label = 'C'
that are also
AND np.parent_node_id = p.parent_node_id);
ancestors to C
Find trees that contain a clade that includes A and B but not C:
List the set of trees with these ancestors
SELECT DISTINCT t.tree_id, t.name
FROM node_path p, nodes ch, trees t
WHERE
ch.node_id = p.child_node_id
AND ch.tree_id = t.tree_id
AND p.parent_node_id IN (
SELECT pA.parent_node_id
FROM node_path pA, node_path pB, nodes nA, nodes nB
WHERE pA.parent_node_id = pB.parent_node_id
AND pA.child_node_id = nA.node_id
AND nA.node_label = 'A'
AND pB.child_node_id = nB.node_id
AND nB.node_label = 'B')
AND NOT EXISTS (
SELECT 1 FROM node_path np, nodes n
WHERE
np.child_node_id = n.node_id
AND n.node_label = 'C'
AND np.parent_node_id = p.parent_node_id);
Get all
ancestors
shared by
A and B
Exclude those
that are also
ancestors to C
Find trees that contain a clade that includes (A, B, C) but not D or E:
SELECT qry.tree_id, MIN(qry.name) AS "tree_name"
FROM ( SELECT DISTINCT ON (n.node_id) n.node_id, t.tree_id, t.name
FROM trees t, nodes n,
(SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id
FROM nodes inN, node_path inP
WHERE inN.node_label IN ('A','B','C')
AND inP.child_node_id = inN.node_id
GROUP BY inN.tree_id, inP.parent_node_id
HAVING COUNT(inP.child_node_id) = 3
ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca,
WHERE n.node_id IN (lca2.parent_node_id)
AND t.tree_id = n.tree_id
AND NOT EXISTS (SELECT 1
FROM nodes outN, node_path outP
WHERE outN.node_label IN ('D','E')
AND outP.child_node_id = outN.node_id
AND outP.parent_node_id = lca.parent_node_id)
AND EXISTS (SELECT c.tree_id
FROM trees c, nodes q
WHERE q.node_label IN ('D','E')
AND q.tree_id = c.tree_id
AND c.tree_id = t.tree_id
GROUP BY c.tree_id
HAVING COUNT(c.tree_id) = 2)) AS qry
GROUP BY (qry.tree_id)
HAVING COUNT(qry.node_id) = 1;
Get all ancestors
of A, B, C from all
trees that have
Number of ingroupsA,
that
node
B, share
C
Exclude those
that are also
ancestors to D, E
But make sure that
the tree still contains D, E
Number of non-ingroups that must be in tree
Number of clades that each tree must satisfy
Here's a faster, cleaner version:
SELECT t.tree_id, t.name
FROM trees t
INNER JOIN
(SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id
FROM nodes inN, node_path inP
WHERE inN.node_label IN ('A','B','C')
AND inP.child_node_id = inN.node_id
GROUP BY inN.tree_id, inP.parent_node_id
HAVING COUNT(inP.child_node_id) = 3
ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca
USING (tree_id)
WHERE NOT EXISTS (
SELECT 1
FROM nodes outN, node_path outP
WHERE outN.node_label IN ('D','E')
AND outP.child_node_id = outN.node_id
AND outP.parent_node_id = lca.parent_node_id)
AND EXISTS (
SELECT c.tree_id
FROM trees c, nodes q
WHERE q.node_label IN ('D','E')
AND q.tree_id = c.tree_id
AND c.tree_id = t.tree_id
GROUP BY c.tree_id
HAVING COUNT(c.tree_id) = 2);
Matching a whole tree means querying for all clades
A
B
3
4
C
7
2
D
E
8
9
6
5
1
(A, B) but not C, D, E
(C, D) but not A, B, E
(C, D, E) but not A, B
Dealing with the Growth of Phyloinformatics
• Trees Too Many
– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big
– Visualizing and manipulating large trees
• Demo PhyloWidget
Mining trees for interesting, general,
relationship questions:
Sus scrofa
Balaenoptera
Hippopotamus
Hippopotamus
Balaenoptera
Sus scrofa
Equus caballus
Equus caballus
Felis catus
Felis catus
(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_caballus)
vs
((Sus_scrofa, (Hippopotamus,Balaenoptera)),Equus_caballus)
Even if with perfectly-resolved OTUs, you will still fail to
hit relevant trees:
Sus scrofa
Sus celebensis
Hippopotamus
Hippopotamus
Balaenoptera
Balaenoptera
Equus caballus
Equus asinus
Felis catus
Felis catus
Step 1: for each clade all trees in database, run a
stem query on a classification tree (e.g. NCBI)
Step 2: label each node with an NCBI taxon id (if
there is a match)
Step 3: do the same for the query tree
A
B
3
4
C
7
2
D
E
8
9
6
5
1
Stem Queries:
Node 2: (>A, B - C, D, E)
Node 3: (>A - B, C, D, E)
Node 4: (>B - A, C, D, E)
Node 5: (>C, D, E - A, B)
Node 6: (>C, D - A, B, E)
Node 7: (>C - A, B, D, E)
Node 8: (>D - A, B, C, E)
Node 9: (>E - A, B, C, D)
Rename nodes according to their deepest stem query…
Gorilla gorilla
Homo sapiens
Gorilla
Hominoidea
Homo
Pan troglodytes
Pan
Macaca sinica
Macaca sinica
Macaca nigra
Cercopithecoidea
Macaca nigra
Pongo pygmaeus
Hominoidea
Macaca irus
Cercopithecoidea
Dealing with the Growth of Phyloinformatics
• Trees Too Many
– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big
– Visualizing and manipulating large trees
• Demo PhyloWidget
PhyloWidget
• Greg Jordan
– Google Summer of Code student
– Nick Goldman's group, EBI
• Java Applet
– Uses the Processing graphics library
• Originally as a graphical phylogenetic query and
display tool for TreeBASE, BioSQL, etc
• Can be used for:
– Manipulating, visualizing large trees
– Building supertrees through pruning & grafting
Thanks
Download