The Tree of Life: Challenges for Discrete Math and Theoretical Computer Science

The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science Fred S. Roberts DIMACS Rutgers University The tree of life problem raises new challenges for mathematics and computer science just as it does for biological science. For math. and CS to become more effectively utilized, we need to: •develop new tools; •establish working partnerships between mathematical scientists and biological scientists; •introduce the two communities to each others’ problems, language, and tools; •introduce outstanding junior researchers from both sides to the issues, problems, and challenges of problems arising from the tree of life; •involve biological and mathematical scientists together to define the agenda and develop the tools of this field. These are some of the motivations for this meeting. I will lay out some of the challenges for math and CS, with emphasis on discrete math and theoretical CS. What are DM and TCS? •DM deals with: –arrangements –designs –codes –patterns –schedules –assignments •TCS deals with the theory of computer algorithms. •During the first 30-40 years of the computer age, TCS, aided by powerful mathematical methods, had a direct impact on technology, by developing models, data structures, algorithms, and lower bounds that are now at the core of computing. DM and TCS have found extensive use in many areas of science and public policy, for example in Molecular Biology. These tools seem especially relevant to problems of the tree of life DM and TCS Continued •These tools are made especially relevant to the tree of life problem because of: –Geographic Information Systems DM and TCS Continued –Availability of large and disparate computerized databases on subjects relating to species and the relevance of modern methods of data mining. Outline • • • • • • • Phylogenetic Tree Reconstruction Database Issues Nomenclature Setting up a Species Bank Digitization of Natural History Collections Interoperability The Many Applications of Research on the Tree of Life Phylogenetic Tree Reconstruction Phylogeny (continued) •New methods of phylogenetic tree reconstruction owe a significant amount to modern methods of DM/TCS. •Trees, supertrees, consensus trees will all be discussed at length in this meeting •I will only make a few brief remarks about them. Phylogenetic Challenges for DM/TCS •Tailoring phylogenetic methods to describe the idiosyncracies of viral evolution -- going beyond a binary tree with a small number of contemporaneous species appearing as leaves. •Dealing with trees of thousands of vertices, many of high degree. •Making use of data about species at internal vertices (e.g., when data comes from serial sampling of patients). Phylogenetic Challenges for DM/TCS: Continued •Network representations of evolutionary history - if recombination has taken place. •Modeling viral evolution by a collection of trees -- to recognize the “quasispecies” nature of viruses. •Devising fast methods to average the quantities of interest over all likely trees. Thanks to Eddie Holmes and Mike Steel for ideas. DIMACS Working Group on Phylogenetic Trees and Rapidly Evolving Diseases, Sept. 3-6, 2003 Database Issues • Assembling the tree of life requires collecting massive amounts of data about the world’s scientific species. • Making it a collaborative project requires making such data universally available. • There are great challenges for Math and CS, specifically DM and TCS. Thanks to the Global Biodiversity Information Facility (GBIF) for many of the following ideas. Complexity of Data • In many ways, data about the world’s species are far more complex than genetic or protein sequence data. (GBIF) Complexity of Data (cont’d) • There are databases of images, databases in numerous forms, etc. • Data is heterogeneous. • Data has errors and inconsistencies. Nomenclature •There are some 1.75M named species •By some estimates, there are up to 10M actual species. Nomenclature (cont’d) • The same species is often named more than once. • On the average, each species has two additional names (synonyms) besides its own name. (GBIF) Nomenclature (cont’d) • Thus, there is need to assemble names in an electronic catalogue, with synonyms and common misspellings. • This would be of fundamental importance in aiding research on biodiversity. Nomenclature (cont’d) • Because of errors, one major challenge for TCS is data cleaning. Nomenclature (cont’d) • Another challenge is to search a database to see if two entries are similar. • This is a standard problem in database theory. • TCS algorithms involving k-nearest neighbor and other methods are very helpful here. Setting up a Species Bank Setting up a Species Bank (cont’d) • A species bank would provide not only names, but also data about a species: – Type – Distribution – Ecological role – Phylogenetic history – Physiology – Genomics • This involves issues about huge datasets. Setting up a Species Bank (cont’d) • NASA earth science satellites alone beam home image data at the rate of 1.2 terabytes a day. • By 2010, this is expected to grow to 10 petabytes a day. (Kathleen Bergen, U. Michigan) Name Equal to: Size in Bytes Bit 1 bit 1/8 Nibble 4 bits 1/2 (rare) Byte 8 bits 1 Kilobyte 1,024 bytes 1,024 Megabyte 1,024 kilobytes 1,048,576 Gigabyte 1,024 megabytes 1,073,741,824 Terrabyte 1,024 gigabytes 1,099,511,627,776 Petabyte 1,024 terrabytes 1,125,899,906,842,624 Exabyte 1.024 petabytes 1,152,921,504,606,846,976 Zettabyte 1,024 exabytes 1,180,591,620,717,411,303,424 Yottabyte 1,024 zettabytes 1,208,925,819,614,629,174,706,176 Setting up a Species Bank (cont’d) • The problem is even worse: We need to combine information from many databases. • There is no known way to catalogue all species of plants in one place given current database systems techniques. (Jessie Kennedy, Napier University, Edinburgh) Setting up a Species Bank (cont’d) • One possible approach: Tree and graph methods to support overlapping classifications as directed acyclic graphs or with complex objects (taxa or specimens) as nodes. (Jessie Kennedy) Digitizing Natural History Collections • It has been estimated that there are between 1.5 and 3 Billion specimens in the world’s natural history collections, including herbaria, living microorganism stock centers, and other repositories (GBIF). Digitizing Natural History Collections (cont’d) • If we could digitize information about these specimens, and make them available, we would “have a treasure trove of information about the world’s biota.” (GBIF) • Pilot projects have shown that utilizing digitized data from several institutions’ databases can be a powerful tool. (GBIF) Digitizing Natural History Collections (cont’d) • Challenge: digitization and reference of nonstandard data (photos, sonograms, field notes) Digitizing Natural History Collections (cont’d) • Challenge: Develop methods for visualizing the data (e.g., species’ distributions) Digitizing Natural History Collections (cont’d) • Challenge: Develop search engines for real-time searching of such extremely large data sets. Digitizing Natural History Collections (cont’d) • Challenge: Make information access on the web more knowledge-based so humans and intelligent software can work together. (Susan Gauch, U. Kansas) Digitizing Natural History Collections (cont’d) • Challenge: Use “intelligent agents” to organize and present relevant information on the web. (Susan Gauch) Digitizing Natural History Collections (cont’d) • Challenge: Use partial information as “training data” for classification algorithms (Susan Gauch) • One approach: Use training data and classification algorithms with learning capabilities. (See: DIMACS project on Monitoring Message Streams) Digitizing Natural History Collections (cont’d) • Another approach to problems posed by digitization: Use tools of “knowledge inferencing” (Yannis Ioannidis, University of Wisconsin) • Still another approach: Use methods of spatio-temporal data mining (Ioannidis; see work of Muthukrishnan at Rutgers) Interoperability • Goal: Devise standards for datasets so as to allow researchers to collaborate across datasets – develop standards leading to database interoperability. (GBIF) Interoperability • Challenge: How do we develop ways to more accurately represent observational or experimental data so that others may use them? (Jessie Kennedy) • Challenge: Deal with issues of inconsistency and scalability. • Challenge: Formalize issues of policy with regard to others’ databases. • Challenge: Interoperability over a diversity of users and types of equipment. Interoperability • One approach: “Semantic Web” – the idea used to express the growing desire to make information access on the Web more knowledge-based so humans and intelligent software can work together. (Susan Gauch) Interoperability • Another approach: Make use of languages such as XML developed to aid interoperability in business and military collaborations. The Many Applications of Research on the Tree of Life • Side benefits in many fields: – Agriculture – Biomedicine – Biotechnology – Natural resource management – Pest control – Control of emergent diseases – Sustainable use of biodiversity resources – Global climate change The Many Applications of Research on the Tree of Life • Let’s say you’re importing bananas from South America The Many Applications of Research on the Tree of Life • A camera in the hold of the ship sees a spider. • What kind of spider is it? • Is it safe to unload your cargo of bananas? The Many Applications of Research on the Tree of Life • Luckily, you have a digitized natural history database. • With an efficient search feature. (Thanks to Diana Lipscomb for this example.) The Many Applications of Research on the Tree of Life

The Tree of Life: Challenges for Discrete Math and Theoretical Computer Science

Related documents

Products

Support

The Tree of Life: Challenges for Discrete Math and Theoretical Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib