Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information Technology Services Director, Information Technology Core, Indiana Genomics Initiative Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart 6th Metacomputing Symposium 22 May 2003 License terms • • Please cite as: Stewart, C.A. Current challenges and opportunities in Biogrids. 2003. Presentation. Presented at: 6th Metacomputing Symposium (High Performance Computing Center, Universitaet Stuttgart, Stuttgart, Germany, 22 May 2003). Available from: http://hdl.handle.net/2022/15217 Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2 Outline • • • • Background about grids and biology Biodata grids Biocomputation grids Some comments and suggestions regarding the challenges and opportunities for the computing community and the biology community • NB: – Likely more questions than answers! – “Grids” will be defined loosely, and not necessarily consistently – Similar lack of precision will be employed with the various flavors of “–omics.” Ultimately it’s all computational biology. Why do subject-specific grids exist?1 • In general: – Practical issues – Communities of practice and trust – Existence of specific problems that appear to call for grid-based approaches (e.g. GriPhyN) • In biology: – Rudimentary “grid” projects predate the Web. Example: Flybase via Gopher. [Flybase dates to 1993] – Fractionated communities – Many independent data sources suggest a grid approach 1 These views may be peculiar to the US or to the speaker The revolution in biology • Automated, high-throughput sequencing has revolutionized biology. • Computing has been a part of this revolution in three ways so far: – Computing has been essential to the assembly of genomes – There is now so much biological data available that it is impossible to utilize it effectively without aid of computers – Networking and the Web have made biological data generally and publicly available • Computing should be in the future critical for: – Automated data analysis – Simulation and prediction http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Biodata Grids So how big is big? • Genbank has grown exponentially, but the total sequences are now still only ~30B base pairs • All of the data and programs from NCBI could be fit on one reasonably large supercomputer • Even BIRN, the most ambitious of planned bio data grid projects, has a data set that will grow 10s to 100s of TBs per year • ‘large dataset’ in the biological sciences ≠ ‘large dataset’ in the physical sciences • Complexity of linkages within the data, however… How many data sources? • DNA/Chromosomes – GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov – European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes – DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp • Proteins – ExPASy http://www.expasy.org/ – Protein Data Base – PDB http://www.rcsb.org/pdb/ • Biochemistry & Enzymes – PathDB http://www.ncgr.org/software/version_2_0.html – Kegg WIT http://wit.mcs.anl.gov/WIT2/ • Not to mention the organism-specific databases The needs and opportunities in Biodata grids • Many disparate subcommunities, many funding sources, lots of history • NCBI, DDBJ, EMBO contain essentially the same data; they complement/compete in terms of features and functions. • Web clicking is not a suitable way to do large-scale computing! • Private companies may need to be very private http://www.ncbi.nlm.nih.gov/ Data integration and management • Person-intensive downloads • Avaki (http://www.avaki.com/) • Lion Biosciences (http://www.lionbioscience.com/) • IBM – DB2 Information Integrator and DiscoveryLink (www.ibm.com/) • Various XML-based efforts IU Centralized Life Science Database (CSLD) • Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges • Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM) • Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. • Lab data and programs like BLAST are included via DL’s wrappers. • Implemented in partnership with IBM Life Sciences via IUIBM strategic relationship in the life sciences • IU contributed writing of data parsers Biocomputation Grids Organisms Orders of magnitude in biology 10 10 10 Cells Biopolymers Evolutionary processes Organ function Electrostatic continuum models Cell signaling 0 10 0 10 DNA replication 6 3 10 Atoms Size Scale Ecosystems and epidemiology 6 3 10 10 Finite element models 0 10 10 Discrete Automata models 3 10 10 6 Ab initio quantum chemistry Enzyme mechanisms Protein folding Empirical force field molecular dynamics 6 3 Homology-based protein modeling First principles molecular dynamics 0 10 -15 10 -12 10 -9 10 -6 10 -3 0 10 Timescale (seconds) 3 10 6 10 9 10 Geologic & Evolutionary Timescales Slide source: Rick Stevens, Argonne National Laboratory; information source DOE Genomes to Life © Example large-scale computational biology grid projects • Department of Energy “Genomes to Life” http://doegenomestolife.org/ • Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/ • Asia Pacific BioGrid (http://www.apbionet.org/) • Encyclopedia of Life (http://eol.sdsc.edu/) integrated Genomic Annotation Pipeline - iGAP structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) sequence info NR, PFAM 104 entries Deduced Protein sequences ~800 genomes @ 10k-20k per =~107 ORF’s Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 4 CPU years 228 CPU years 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 570 CPU years Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence Slide source: San Diego Supercomputing Center © 252 CPU years 3 CPU years Store assigned regions in the DB One example: Building Phylogenetic Trees • Goal: an objective means by which phylogenetic trees can be estimated • The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 • Solution: heuristic search • Trees built incrementally. Trees are optimized in steps, and best tree(s) are then kept for next round of additions • High communication/compute ratio fastDNAml performance on an international Grid 3500 3000 Wall clock time (seconds) 2500 2000 IU Only IU&NUS IU&ANU 1500 1000 500 0 0 2 4 6 8 10 12 # Processors From iGrid ’98 at SC98 14 16 18 fastDNAml Performance on IBM SP 70 60 SpeedUp 50 40 30 20 10 0 0 10 20 30 40 50 Number of Processors Perfect Scaling 50 Taxa 101 Taxa 150 Taxa From Stewart et al., SC2001 60 70 fastDNAml and Biogrid Computing • IU-created library called SMBL (Simple Message Brokering Library) permits use of Condor flocks as “worker” processes • fastDNAml has a very high compute/communicate ratio • fastDNAml is one example of a general phenomenon in biogrid computation: How much of it is really capability computing, and how much of it would be high-throughput computing if the applications were really well written? Some thoughts about the future Current challenge areas Problem High Throughput Grid Capability Protein modeling X Genome annotation, alignment, phylogenetics X X x* Drug Target Screening X X X (corporate grids) Systems biology X X Medical practice support X X *Only a few large scale problems merit ‘capability’ status What is the killer application for biocomputation grids? • Systems biology – latest buzzword, but…. (see special issues in Nature and Science) • Goal: multiscale modeling from cell chemistry up to multiple populations • Current software tools still inadequate • Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications • The structure of the problems match the structure of grids • Current challenge examples: actin fiber creation, heart attack modeling • Opportunity for predictive biology? Opportunities in Computational Biology and Biomedical Research From www.sciencemag.org/ feature/data/mosquito/mtm/index.html Source Library: Centers for Disease Control Photo Credit: Jim Gathany • Bioinformatics and related areas offer tremendous new possibilities • Computer-oriented biomedical researchers must utilize the detailed knowledge held by “traditional” researchers • There are tremendous opportunities for computer scientists and computational scientists to find and solve interesting and important problems! Some thoughts about the future of Grids and biocomputing • Biodata problems are largely solvable now without use of sophisticated grid technology. This will change! • Biocomputation grids must be developed with appropriate technology choices. Enhancement of software must happen simultaneously! • Until the grid software becomes substantially simpler for the end user, grid projects will likely continue to be based on communities of common interest. • There are many biodata grid and biocomputation grid opportunities that are a good match for grid architectures. There are natural similarities between the structure of grids and the likely structure of significant grand challenge problems in computational biology, biomedicine, etc. Acknowledgments • This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. • This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). • Particular thanks to Dr. Michael Resch, Director, HLRS, for inviting me to visit HLRS, and to Dr. Matthias MĪller and Peggy Lindner for inviting me to speak here today. Acknowledgements con’t • UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar • Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock • UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom • Assistance with this presentation: John Herrin, Malinda Lingwall Additional Information • Further information is available at – http://www.indiana.edu/~uits/rac/ – http://www.indiana.edu/~rac/staff_papers.html – http://www.casc.org • A recommended German bioinformatics site: – http://www.bioinformatik.de/