X - Indiana University

advertisement
Current challenges and
opportunities in Biogrids
Dr. Craig A. Stewart
stewart@iu.edu
Director, Research and Academic Computing, University
Information Technology Services
Director, Information Technology Core, Indiana Genomics
Initiative
Visiting Scientist, Höchstleistungsrechenzentrum Universität
Stuttgart
6th Metacomputing Symposium 22 May 2003
License terms
•
•
Please cite as: Stewart, C.A. Current challenges and opportunities in
Biogrids. 2003. Presentation. Presented at: 6th Metacomputing
Symposium (High Performance Computing Center, Universitaet Stuttgart,
Stuttgart, Germany, 22 May 2003). Available from:
http://hdl.handle.net/2022/15217
Except where otherwise noted, by inclusion of a source url or some other
note, the contents of this presentation are © by the Trustees of Indiana
University. This content is released under the Creative Commons
Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse you
or your use of the work). For any reuse or distribution, you must make
clear to others the license terms of this work.
2
Outline
•
•
•
•
Background about grids and biology
Biodata grids
Biocomputation grids
Some comments and suggestions regarding the
challenges and opportunities for the computing
community and the biology community
• NB:
– Likely more questions than answers!
– “Grids” will be defined loosely, and not necessarily consistently
– Similar lack of precision will be employed with the various flavors
of “–omics.” Ultimately it’s all computational biology.
Why do subject-specific grids exist?1
• In general:
– Practical issues
– Communities of practice and trust
– Existence of specific problems that appear to call for grid-based
approaches (e.g. GriPhyN)
• In biology:
– Rudimentary “grid” projects predate the Web. Example: Flybase
via Gopher. [Flybase dates to 1993]
– Fractionated communities
– Many independent data sources suggest a grid approach
1 These
views may be peculiar to the US or to the speaker
The revolution in biology
• Automated, high-throughput sequencing has
revolutionized biology.
• Computing has been a part of this revolution in
three ways so far:
– Computing has been essential to the assembly of genomes
– There is now so much biological data available that it is
impossible to utilize it effectively without aid of computers
– Networking and the Web have made biological data generally
and publicly available
• Computing should be in the future critical for:
– Automated data analysis
– Simulation and prediction
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Biodata Grids
So how big is big?
• Genbank has grown exponentially, but the total
sequences are now still only ~30B base pairs
• All of the data and programs from NCBI could be fit on
one reasonably large supercomputer
• Even BIRN, the most ambitious of planned bio data grid
projects, has a data set that will grow 10s to 100s of TBs
per year
• ‘large dataset’ in the biological sciences ≠
‘large dataset’ in the physical sciences
• Complexity of linkages within the data, however…
How many data sources?
• DNA/Chromosomes
– GenBank. Operated by NCBI (National Center for Biotechnology
Information). http://www.ncbi.nlm.nih.gov
– European Molecular Biology Laboratory – Nucleotide Sequence
Database. http://www.ebi.ac.uk/genomes
– DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp
• Proteins
– ExPASy http://www.expasy.org/
– Protein Data Base – PDB http://www.rcsb.org/pdb/
• Biochemistry & Enzymes
– PathDB http://www.ncgr.org/software/version_2_0.html
– Kegg WIT http://wit.mcs.anl.gov/WIT2/
• Not to mention the organism-specific databases
The needs and opportunities in
Biodata grids
• Many disparate
subcommunities, many
funding sources, lots of
history
• NCBI, DDBJ, EMBO contain
essentially the same data;
they complement/compete in
terms of features and
functions.
• Web clicking is not a suitable
way to do large-scale
computing!
• Private companies may need
to be very private
http://www.ncbi.nlm.nih.gov/
Data integration and management
• Person-intensive downloads
• Avaki (http://www.avaki.com/)
• Lion Biosciences
(http://www.lionbioscience.com/)
• IBM – DB2 Information Integrator and
DiscoveryLink (www.ibm.com/)
• Various XML-based efforts
IU Centralized Life Science Database
(CSLD)
• Goal set by IU School of Medicine: Any research within the
IU School of Medicine should be able to transparently
query all relevant public external data sources and all
sources internal to the IU School of Medicine to which the
researcher has read privileges
• Based on use of IBM DiscoveryLink(TM) and DB/2
Information Integrator(TM)
• Public data is still downloaded, parsed, and put into a
database, but now the process is automated and
centralized.
• Lab data and programs like BLAST are included via DL’s
wrappers.
• Implemented in partnership with IBM Life Sciences via IUIBM strategic relationship in the life sciences
• IU contributed writing of data parsers
Biocomputation Grids
Organisms
Orders of magnitude in biology
10
10
10
Cells
Biopolymers
Evolutionary
processes
Organ function
Electrostatic
continuum models
Cell signaling
0
10
0
10
DNA
replication
6
3
10
Atoms
Size Scale
Ecosystems
and
epidemiology
6
3
10
10
Finite element
models
0
10
10
Discrete Automata
models
3
10
10
6
Ab initio
quantum chemistry
Enzyme
mechanisms
Protein
folding
Empirical force field
molecular dynamics
6
3
Homology-based
protein modeling
First principles
molecular dynamics
0
10
-15
10
-12
10
-9
10
-6
10
-3
0
10
Timescale (seconds)
3
10
6
10
9
10
Geologic &
Evolutionary
Timescales
Slide source: Rick Stevens, Argonne National Laboratory;
information source DOE Genomes to Life ©
Example large-scale computational
biology grid projects
• Department of Energy “Genomes to Life”
http://doegenomestolife.org/
• Biomedical Informatics Research Network
(BIRN) http://birn.ncrr.nih.gov/birn/
• Asia Pacific BioGrid (http://www.apbionet.org/)
• Encyclopedia of Life (http://eol.sdsc.edu/)
integrated Genomic Annotation Pipeline - iGAP
structure info
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
sequence info
NR, PFAM
104
entries
Deduced Protein sequences
~800 genomes
@ 10k-20k per
=~107 ORF’s
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
4 CPU
years
228 CPU
years
3 CPU
years
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
570 CPU
years
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
Slide source: San Diego Supercomputing Center ©
252 CPU
years
3 CPU
years
Store assigned regions in the DB
One example: Building Phylogenetic
Trees
• Goal: an objective means by
which phylogenetic trees can be
estimated
• The number of bifurcating
unrooted trees for n taxa is
(2n-5)!/ (n-3)! 2n-3
• Solution: heuristic search
• Trees built incrementally. Trees
are optimized in steps, and best
tree(s) are then kept for next
round of additions
• High communication/compute
ratio
fastDNAml
performance on an international Grid
3500
3000
Wall clock time (seconds)
2500
2000
IU Only
IU&NUS
IU&ANU
1500
1000
500
0
0
2
4
6
8
10
12
# Processors
From iGrid ’98 at SC98
14
16
18
fastDNAml
Performance on IBM SP
70
60
SpeedUp
50
40
30
20
10
0
0
10
20
30
40
50
Number of Processors
Perfect Scaling
50 Taxa
101 Taxa
150 Taxa
From Stewart et al., SC2001
60
70
fastDNAml and Biogrid Computing
• IU-created library called SMBL (Simple Message
Brokering Library) permits use of Condor flocks
as “worker” processes
• fastDNAml has a very high
compute/communicate ratio
• fastDNAml is one example of a general
phenomenon in biogrid computation: How much
of it is really capability computing, and how
much of it would be high-throughput computing if
the applications were really well written?
Some thoughts about the future
Current challenge areas
Problem
High
Throughput
Grid
Capability
Protein modeling
X
Genome annotation,
alignment,
phylogenetics
X
X
x*
Drug Target Screening
X
X
X
(corporate grids)
Systems biology
X
X
Medical practice
support
X
X
*Only a few large scale problems merit ‘capability’ status
What is the killer application for
biocomputation grids?
• Systems biology – latest buzzword, but…. (see special
issues in Nature and Science)
• Goal: multiscale modeling from cell chemistry up to
multiple populations
• Current software tools still inadequate
• Multiscale modeling calls for use of established HPC
techniques – e.g. adaptive mesh refinement, coupled
applications
• The structure of the problems match the structure of
grids
• Current challenge examples: actin fiber creation, heart
attack modeling
• Opportunity for predictive biology?
Opportunities in Computational Biology
and Biomedical Research
From
www.sciencemag.org/
feature/data/mosquito/mtm/index.html
Source Library: Centers for Disease Control
Photo Credit: Jim Gathany
• Bioinformatics and related
areas offer tremendous new
possibilities
• Computer-oriented biomedical
researchers must utilize the
detailed knowledge held by
“traditional” researchers
• There are tremendous
opportunities for computer
scientists and computational
scientists to find and solve
interesting and important
problems!
Some thoughts about the future of
Grids and biocomputing
• Biodata problems are largely solvable now without use of
sophisticated grid technology. This will change!
• Biocomputation grids must be developed with appropriate
technology choices. Enhancement of software must happen
simultaneously!
• Until the grid software becomes substantially simpler for the end
user, grid projects will likely continue to be based on communities of
common interest.
• There are many biodata grid and biocomputation grid opportunities
that are a good match for grid architectures. There are natural
similarities between the structure of grids and the likely structure of
significant grand challenge problems in computational biology,
biomedicine, etc.
Acknowledgments
• This research was supported in part by the Indiana Genomics
Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of
Indiana University is supported in part by Lilly Endowment Inc.
• This work was supported in part by Shared University Research
grants from IBM, Inc. to Indiana University.
• This material is based upon work supported by the National Science
Foundation under Grant No. 0116050 and Grant No. CDA-9601632.
Any opinions, findings and conclusions or recommendations
expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation
(NSF).
• Particular thanks to Dr. Michael Resch, Director, HLRS, for
inviting me to visit HLRS, and to Dr. Matthias MĪ‹ller and Peggy
Lindner for inviting me to speak here today.
Acknowledgements con’t
• UITS Research and Academic Computing Division managers:
Mary Papakhian, David Hart, Stephen Simms, Richard Repasky,
Matt Link, John Samuel, Eric Wernert, Anurag Shankar
• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison,
Huian Li, Jagan Lakshmipathy, David Hancock
• UITS Senior Management: Associate Vice President and Dean
Christopher Peebles, RAC(Data) Director Gerry Bernbom
• Assistance with this presentation: John Herrin, Malinda Lingwall
Additional Information
• Further information is available at
– http://www.indiana.edu/~uits/rac/
– http://www.indiana.edu/~rac/staff_papers.html
– http://www.casc.org
• A recommended German bioinformatics site:
– http://www.bioinformatik.de/
Download