Life Science Software and High Performance Computing Seminar Series Part IV Craig A. Stewart Fulbright Senior Scholar beim ZIH Associate Vice President, Research & Academic Computing License Terms • • • • Please cite this presentation as: Stewart, C.A. Life Science Software and High Performance Computing: Seminar Series Part IV. 2006. Presentation. Presented at: Technische Universitaet Dresden (Dresden, Germany, 27 Apr 2006). Available from: http://hdl.handle.net/2022/14767 Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document. Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse. Except where otherwise noted, the contents of this presentation are copyright 2007 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. Life Science Software and HPC Seminar Plan as of today • Today: – Some thoughts and observations on US national projects and centers • Funding agencies • HPC/grid computing • Bioinformatics and computational biology – Performance analysis • Late June – another visit to Dresden, associated with the ISC • Late August – another visit to Dresden, associated with Euro-PAR US Funding agencies (1) • • • National Science Foundation - $5.5B/year annual budget, fund about 20% of all basic research in US. Basic research in comp sci, math, biology, geology, etc. www.nsf.gov National Institutes of Health - $27.5B/year. Funds largest share of medical research. 27 separate institutes and centers www.nih.gov Department of Energy. Funds much applied and basic research. Funds: Argonne National Laboratory, Brookhaven National Laboratory, Fermi National Accelerator Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, Sandia National Laboratories, Stanford Linear Accelerator Center, Electron accelerators, Thomas Jefferson National Accelerator Facility www.doe.gov US Funding agencies (2) • Department of Defense. http://www.defenselink.mil/ – Defense Advanced http://www.darpa.mil/ – High Productivity Computing Systems program http://www.darpa.mil/ipto/programs/hpcs/programplan.htm • Military branches (esp. Army, Navy, Air Force) • Department of Homeland Security http://www.dhs.gov/dhspublic/ • National Security Agency www.nsa.gov • Congressional markups Some shining successes • • • • DARPANet/Internet/Abilene NSF HPC Centers/NITRD “Hallmark” demos e.g. Tornado, Caterpillar bulldozer design It’s really possible for a good researcher to get time on a nationally shared superocmputer and get help with it DARPA High Productivity Computing System program http://www.darpa.mil/ipto/programs/hpcs/programplan.htm IBM, Cray, Sun currently phase II industry partners Real, not peak http://www.darpa.mil/ipto/programs/hpcs/assessment.htm Current Top500 list • • DOE impact on top of list! http://www.top500.org/lists/200 5/11/basic NSF strategies • • • • • • Office of Cyberinfrastructure. Daniel Atkins, Director Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. http://www.nsf.gov/publications/pub_summ.jsp?ods_key=cise051203 (aka “the Atkins Report”). Draft – NSF’s Cyberinfrastructure vision for the 21st century. http://www.nsf.gov/od/oci/ci_v5.pdf NSF Cyberinfrastructure panel Systems – $30M/year x 4 solicitations for large shared systems – $200M for a 1 PetaFLOPS *achieved* system – Focus on science results Software – National Middleware Initiative National supercomputer centers • • • • • Pittsburgh Supercomputer Center San Diego Supercomputer Center National Computational Science Alliance TeraGrid Other university centers of note: – Purdue University – Ohio Supercomputer Center – Louisiana State University – Texas Advanced Computer Center – Texas Tech – Rice – Cal-Tech – Cornell – U. Chicago (computation, electronic visualization lab) – Florida/SURA NIH • National Center for Research Resources • Really focused on clinical resources, not computing resources • NIH is perhaps doing more than any other funding agency to promote openness in research as a result of its data access policies and support for open source software • National library of medicine, protein data bank (also supported by NSF) A semirandom walk through some US projects CIPRES Cyberinfrastructure for Phylogenetic Research (CIPRES) • http://www.phylo.org/ • The largest active phylogenetics group going. “The goal of the CIPRES project is to enable large-scale phylogenetic reconstructions on a scale that will enable analyses of huge datasets containing hundreds of thousands of bio molecular sequences “ Have 5 years of funding. • Computational phylogenetics activities: phylogenetic reconstruction from gene order, gene sequences. Horizontal gene transfer. Renci (renaisannce computing institute) • http://www.renci.org/ • Led by Dan Reed. “a major collaborative venture of Duke University, North Carolina State University, the University of North Carolina at Chapel Hill and the state of North Carolina.” • Funding through the National Middleware Initiative • Key role in the TeraGrid Argonne National Lab Biosciences Division • Let by Rick Stevens. http://www.bio.anl.gov/ • LOTS of structural biology. Very focused, well funded and dedicated group. Cal-IT2 • Led by Larry Smarr. http://www.calit2.net/ • Lots of areas of focus, including “ – “GEON: The Geosciences Network [GEON] – Laboratory for the Ocean Observatory Knowledge INtegration Grid [LOOKING] – Sensor Networks BIRN • • • • • • Biomedical Informatics Research Network http://www.nbirn.net/ NIH-sponsored attempt to create health-oriented cyberinfrastructure Function BIRN – brain function and disorders, e.g. schizophrenia Morphometry BIRN – brain structural disorders, e.g. Alzheimers Mouse BIRN – studying mouse brain and mouse models of human brain disorders • Grid technology, using federated data system approach, based on Globus, SRB, etc. Optiputer • “The OptIPuter, so named for its use of Optical networking, Internet Protocol, computer storage, processing and visualization technologies, is an envisioned infrastructure that will tightly couple computational resources over parallel optical networks using the IP communication mechanism. The OptIPuter exploits a new world in which the central architectural element is optical networking, not computers creating "supernetworks". • LambdaRAM • http://www.optiputer.net/index.html Genomes to Life • http://www.doegenomestolife.org/ • Original goals: – Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form – Characterize Gene Regulatory Networks – Characterize the Functional Repertoire of Complex Microbial Communities in Their Natural Environments at the Molecular Level – Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior – (Goals taken directly from Genomes to Life web site) Genomes to Life refactored • The Department of Energy’s Office of Science announced ... that it is revising its plans for the deployment of new research facilities to support its Genomics:GTL program. … The specific goal of the new facilities plan will be to accelerate GTL systems biology research in the area of bioenergy, with the objective of developing cost-effective, biologically based renewable energy sources to reduce U.S. dependence on fossil fuels. • http://www.sc.doe.gov/Sub/Newsroom/News_Releases/DOESC/2006/GTL/index.htm Current Genomic Pipeline sequence info structure info NR, PFAM SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Arabidopsis Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence http://eol.sdsc.edu/methodology.html Store assigned regions in the DB Scale of Multi-genome Analysis sequence info structure info NR, PFAM SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 104 entries ~800 genomes @ 10k-20k per =~107 ORF’s Genomes Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 4 CPU years 228 CPU years 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 9 CPU years Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence http://eol.sdsc.edu/methodology.html 252 CPU years 3 CPU years Store assigned regions in the DB Other centers of note • National Resource for Biomedical Supercomputing (NRBSC). Pittsburgh. Source of MCell. http://www.nrbsc.org/. • Scientific Computing and Imaging Institute – Christopher R. Johnson http://www.sci.utah.edu/ • UCSD Bioinformatics Program - http://bioinformatics.ucsd.edu/ • Wash U bioinformatics http://www.ccb.wustl.edu/ • MIT, Johns Hopkins also have interesting programs • List (incomplete) at http://zlab.bu.edu/~mfrith/BioinfoCenters.html Some international efforts • eScience project - http://www.nesc.ac.uk/. EDIAMOND • Japanese Petaflops Protein Folding project http://www.jsbi.org/journal/GIW02/GIW02P121.pdf Some activities at IU • • • • • Flybase – authoritative source of annotated fruit fly genomic information. http://flybase.bio.indiana.edu/ Lifescienceweb http://www.lifescienceweb.org/ – Mutdb http://www.mutdb.org/ – SBLEST “The Structure-Based Local Environment Search Tool uses vectors of amino acid structural environments to perform K Nearest Neighbor queries against a database of protein structures. Our Web services allow for authenticated (password protected) submission of a protein structure, or selection of an existing structure and searching it against common databases and then visualization of the results using UCSF Chimera or PyMOL.” http://www.lifescienceweb.org/index.php?mode=sBlest_about TeraGrid – teragrid.iu.edu IU IT Strategic Plan IU Life Sciences Strategic Plan Some .orgs and commercial activities • • • • • Bioinformatics.org – Includes BioBrew Linux BioPerl http://www.bioperl.org/wiki/Main_Page BioPhython http://www.biopython.org/ BioJava http://biojava.org/wiki/Main_Page BioMoby http://biomoby.open-bio.org/index.php/what-is-moby/ • Bio grid activities – folding@home http://folding.stanford.edu/ – Protein predictor @ home http://predictor.scripps.edu/ – rosetta@home http://boinc.bakerlab.org/rosetta/ – Fight aids @ home http://fightaidsathome.scripps.edu/ – World community grid http://www.worldcommunitygrid.org/ • Commercial: – Apple bioclusters (uses SGE) – IBM Life Science Institutes of Innovation – Sun Center of Excellence – Dell Center of Excellence Some Good Books • • • • • • • • • Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds). Foundations of systems biology. MIT Press, 2001. Kitano (ed) Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics. Springer-Verlag, NY. ISBM 0-387-91562-1 Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence analysis. Cambridge University Press. Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer skills. O’Reilly. Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly. Tisdall, J. 2003. Mastering perl for bioinformatics, O’Reilly. Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge University Press. Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the grid infrastructure a reality. Wiley, Sussex Acknowledgments • • • • Funding for projects described in this talk has come from the National Science Foundation, National Institutes of Health, Lilly Endowment, Inc., State of Indiana (particularly through support of I-light Initiative and the 21st Century Fund) The work described here was made possible by the faculty, students, and staff of Indiana University. Thanks especially to the staff of RAC, CPO, Telecommunications, PTL, UITS generally, the participants in the Indiana Genomics Initiative, and the participants in the METACyt Initiative. Several of the slides and ideas presented here were developed by colleagues or collaborators – the Research and Academic Computing Division of UITS in general, and Dick Repasky in particular. Stewart’s visit to Dresden is funded in part by the Center for the International Exchange of Scholars, the Technical University of Dresden, and Indiana University • And thank you very much! This has been fun and educational for me!