The Tree of Life: Challenges for Discrete Math and Theoretical Computer Science

advertisement
The Tree of Life:
Challenges for Discrete
Mathematics and Theoretical
Computer Science
Fred S. Roberts
DIMACS
Rutgers University
The tree of life problem raises new
challenges for mathematics and
computer science just as it does for
biological science.
For math. and CS to become more
effectively utilized, we need to:
•develop new tools;
•establish working partnerships between
mathematical scientists and biological scientists;
•introduce the two communities to each others’
problems, language, and tools;
•introduce outstanding junior researchers from
both sides to the issues, problems, and challenges
of problems arising from the tree of life;
•involve biological and mathematical scientists
together to define the agenda and develop the
tools of this field.
These are some of the motivations for this
meeting.
I will lay out some of the challenges for
math and CS, with emphasis on discrete
math and theoretical CS.
What are DM and TCS?
•DM deals with:
–arrangements
–designs
–codes
–patterns
–schedules
–assignments
•TCS deals with the theory of computer
algorithms.
•During the first 30-40 years of the computer
age, TCS, aided by powerful mathematical
methods, had a direct impact on technology,
by developing models, data structures,
algorithms, and lower bounds that are now at
the core of computing.
DM and TCS have found extensive use in many
areas of science and public policy, for example in
Molecular Biology.
These tools seem especially relevant to problems of
the tree of life
DM and TCS Continued
•These tools are made especially relevant to
the tree of life problem because of:
–Geographic Information Systems
DM and TCS Continued
–Availability of large and disparate
computerized databases on subjects
relating to species and the relevance of
modern methods of data mining.
Outline
•
•
•
•
•
•
•
Phylogenetic Tree Reconstruction
Database Issues
Nomenclature
Setting up a Species Bank
Digitization of Natural History Collections
Interoperability
The Many Applications of Research on the
Tree of Life
Phylogenetic Tree
Reconstruction
Phylogeny (continued)
•New methods of phylogenetic tree
reconstruction owe a significant amount to
modern methods of DM/TCS.
•Trees, supertrees, consensus trees will all be
discussed at length in this meeting
•I will only make a few brief remarks about
them.
Phylogenetic Challenges for
DM/TCS
•Tailoring phylogenetic methods to describe
the idiosyncracies of viral evolution -- going
beyond a binary tree with a small number of
contemporaneous species appearing as
leaves.
•Dealing with trees of thousands of vertices,
many of high degree.
•Making use of data about species at internal
vertices (e.g., when data comes from serial
sampling of patients).
Phylogenetic Challenges for
DM/TCS: Continued
•Network representations of evolutionary
history - if recombination has taken place.
•Modeling viral evolution by a collection of
trees -- to recognize the “quasispecies” nature
of viruses.
•Devising fast methods to average the
quantities of interest over all likely trees.
Thanks to Eddie Holmes and Mike Steel for ideas.
DIMACS Working Group on Phylogenetic Trees and Rapidly
Evolving Diseases, Sept. 3-6, 2003
Database Issues
• Assembling the tree of life requires
collecting massive amounts of data
about the world’s scientific species.
• Making it a collaborative project
requires making such data universally
available.
• There are great challenges for Math and
CS, specifically DM and TCS.
Thanks to the Global Biodiversity Information Facility
(GBIF) for many of the following ideas.
Complexity of Data
• In many ways,
data about the
world’s
species are
far more
complex than
genetic or
protein
sequence
data. (GBIF)
Complexity of Data (cont’d)
• There are databases of images,
databases in numerous forms, etc.
• Data is heterogeneous.
• Data has errors and inconsistencies.
Nomenclature
•There are some 1.75M named species
•By some estimates, there are up to 10M
actual species.
Nomenclature (cont’d)
• The same species is
often named more
than once.
• On the average, each
species has two
additional names
(synonyms) besides
its own name. (GBIF)
Nomenclature (cont’d)
• Thus, there is need to
assemble names in
an electronic
catalogue, with
synonyms and
common
misspellings.
• This would be of
fundamental
importance in aiding
research on
biodiversity.
Nomenclature (cont’d)
• Because of errors,
one major
challenge for TCS
is data cleaning.
Nomenclature (cont’d)
• Another challenge is to search a database
to see if two entries are similar.
• This is a standard problem in database
theory.
• TCS algorithms involving k-nearest
neighbor and other methods are very
helpful here.
Setting up a Species Bank
Setting up a Species Bank
(cont’d)
• A species bank would provide not only
names, but also data about a species:
– Type
– Distribution
– Ecological role
– Phylogenetic history
– Physiology
– Genomics
• This involves issues about huge datasets.
Setting up a Species Bank
(cont’d)
• NASA earth science
satellites alone beam
home image data at
the rate of 1.2
terabytes a day.
• By 2010, this is
expected to grow to
10 petabytes a day.
(Kathleen Bergen, U.
Michigan)
Name
Equal to:
Size in Bytes
Bit
1 bit
1/8
Nibble
4 bits
1/2 (rare)
Byte
8 bits
1
Kilobyte
1,024 bytes
1,024
Megabyte
1,024 kilobytes
1,048,576
Gigabyte
1,024 megabytes
1,073,741,824
Terrabyte
1,024 gigabytes
1,099,511,627,776
Petabyte
1,024 terrabytes
1,125,899,906,842,624
Exabyte
1.024 petabytes
1,152,921,504,606,846,976
Zettabyte
1,024 exabytes
1,180,591,620,717,411,303,424
Yottabyte
1,024 zettabytes
1,208,925,819,614,629,174,706,176
Setting up a Species Bank
(cont’d)
• The problem is even
worse: We need to
combine information
from many databases.
• There is no known way
to catalogue all
species of plants in
one place given
current database
systems techniques.
(Jessie Kennedy, Napier University,
Edinburgh)
Setting up a Species Bank
(cont’d)
• One possible approach: Tree and graph
methods to support overlapping
classifications as directed acyclic graphs
or with complex objects (taxa or
specimens) as nodes. (Jessie Kennedy)
Digitizing Natural History
Collections
• It has been estimated
that there are between
1.5 and 3 Billion
specimens in the world’s
natural history
collections, including
herbaria, living
microorganism stock
centers, and other
repositories (GBIF).
Digitizing Natural History
Collections (cont’d)
• If we could digitize information about these
specimens, and make them available, we
would “have a treasure trove of
information about the world’s biota.”
(GBIF)
• Pilot projects have shown that utilizing
digitized data from several institutions’
databases can be a powerful tool. (GBIF)
Digitizing Natural History
Collections (cont’d)
• Challenge:
digitization and
reference of nonstandard data
(photos,
sonograms, field
notes)
Digitizing Natural History
Collections (cont’d)
• Challenge:
Develop methods
for visualizing the
data (e.g., species’
distributions)
Digitizing Natural History
Collections (cont’d)
• Challenge: Develop
search engines for
real-time searching of
such extremely large
data sets.
Digitizing Natural History
Collections (cont’d)
• Challenge: Make information access on
the web more knowledge-based so
humans and intelligent software can work
together. (Susan Gauch, U. Kansas)
Digitizing Natural History
Collections (cont’d)
• Challenge: Use
“intelligent
agents” to
organize and
present relevant
information on
the web. (Susan
Gauch)
Digitizing Natural History
Collections (cont’d)
• Challenge: Use partial information as
“training data” for classification
algorithms (Susan Gauch)
• One approach: Use training data and
classification algorithms with learning
capabilities.
(See: DIMACS project on Monitoring
Message Streams)
Digitizing Natural History
Collections (cont’d)
• Another approach to problems posed by
digitization: Use tools of “knowledge
inferencing” (Yannis Ioannidis, University of
Wisconsin)
• Still another approach: Use methods of
spatio-temporal data mining (Ioannidis; see
work of Muthukrishnan at Rutgers)
Interoperability
• Goal: Devise
standards for
datasets so as to
allow researchers
to collaborate
across datasets –
develop standards
leading to
database
interoperability.
(GBIF)
Interoperability
• Challenge: How do we develop ways to
more accurately represent observational
or experimental data so that others may
use them? (Jessie Kennedy)
• Challenge: Deal with issues of
inconsistency and scalability.
• Challenge: Formalize issues of policy with
regard to others’ databases.
• Challenge: Interoperability over a diversity
of users and types of equipment.
Interoperability
• One approach: “Semantic Web” – the
idea used to express the growing
desire to make information access on
the Web more knowledge-based so
humans and intelligent software can
work together. (Susan Gauch)
Interoperability
• Another
approach: Make
use of languages
such as XML
developed to aid
interoperability in
business and
military
collaborations.
The Many Applications of
Research on the Tree of Life
• Side benefits in many fields:
– Agriculture
– Biomedicine
– Biotechnology
– Natural resource management
– Pest control
– Control of emergent diseases
– Sustainable use of biodiversity resources
– Global climate change
The Many Applications of
Research on the Tree of Life
• Let’s say you’re
importing bananas
from South
America
The Many Applications of
Research on the Tree of Life
• A camera in the
hold of the ship
sees a spider.
• What kind of spider
is it?
• Is it safe to unload
your cargo of
bananas?
The Many Applications of
Research on the Tree of Life
• Luckily, you have
a digitized
natural history
database.
• With an efficient
search feature.
(Thanks to Diana
Lipscomb for this
example.)
The Many Applications of
Research on the Tree of Life
Download