Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim and Susan B. Davidson Extended Dewey Labeling Phylogenetic Trees Background • Phylogenetics – the science of identifying and understanding evolutionary relationship between different species • Cyberinfrastructure for Phylogenetic Research project (CIPRes) 6 5 – Design efficient data storage and query capabilities for managing phylogenetic trees – Evaluate existing phylogenetic tree reconstruction algorithms • Building “gold standards” by simulating very large phylogenetic tree as well as sequences for each species in the tree according to models that are carefully curated by experts. 2 – ... 3 1 4 2 • Crimson system focuses on providing data management support for CIPRes simulation. Technical Challenges • PHylogenetic trees may cntain millions of species associated with sequences with thousands of characters. Efficiently manage and query this data is important. • Data management strategies developed for XML are not suitable for phylogenetic tree management. – Different from XML documents used in web and commercial application which are relatively shallow, phylogenetic trees can be very deep. • According to a survey of 200,000 XML documents by Mignet, Barbosa and Veltri in WWW 2003, the average depth of XML was reported to be 4 and the deepest was 135. • Simulation phylogenetic tree have an average depth of greater than 1000, and the deepest can be more than 1 million. Our Solution Data storage and index strategy: extension of the Dewey labeling scheme Query evaluation algorithm which achieve high performance An user friendly data management system: Crimson system – Sampling a set of species according to a given time System Architecture Input Query • The phylogenetic reconstruction problem is NP-hard, so current algorithms can only handle a relative small input set. To benchmark these reconstruction algorithms, we must therefore be able to efficiently sample a subset of species according to various criteria, and project the tree pattern induced by the smaple in the simulation tree. Sampling Sampling Species with Sequences Strategy Query History Projection Tree Simulation Tree Tree Projector GUI Manager Repository Manager Species Repository Tree Repository 1 2 3 4 5 6 Leaves nodes in the file (*100) Benchmark Manager • determining the relationship among a set of species by appealing to an authoritative tree • Given a tree T and a subset S of its leaves, the tree projection of T over S is a “subtree” T’ in which each edge is a subpath of a path from the root of T to a node in S and each node has at least two children. References: Query Repository 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 – Tree projection Tree Viewer Sampling • Guarantee that the sampling results are derived from an evolutionary time period. • Given a tree T with weight on the edge representing time, sampling a set of species according to a given time t will return a subset of T’s leaves set such that for all species, whose evaluation time (the weighted distance from the root to this specie) is t, have the same number of descendant species sampled out. Time to generate the tree and store it given a 20 leaf node set Data Loader • Cyberinfrastructure for Phylogenetic Research (CIPRES) project (www.phylo.org) • Susan B. Davidson, Junhyong Kim, Yifeng Zheng: Efficiently Supporting Structure Queries on Phylogenetic Trees. SSDBM 2005: 93-102 Time to generate and store a subtree from the selected leaves of a phylogenetic tree with 2000 leaves Time(seconds) • • • Performance Results Phylogenetic Queries Time(seconds) – Queries used with phylogenetic trees are also very different from the path-oriented or restructuring quries supported by XPath and XQuery. 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of randomly selected leaves(*10)