Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart stewart@iu.edu Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart Director, Research and Academic Computing, University Information Technology Services Director, Information Technology Core, Indiana Genomics Initiative 25 June 2003 License terms • • Please cite as: Stewart, C.A. Computational Biology: Practical lessons and thoughts for the future. 2003. Presentation. Presented at: Facultaet Informatik (Universitaet Stuttgart, Stuttgart, Germany, 25 Jun 2003). Available from: http://hdl.handle.net/2022/15218 Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2 Outline • The revolution in biology & IU’s response –the Indiana Genomics Initiative • Example software applications – Central Life Sciences Database Service – fastDNAml • What are the grand challenge problems in computational biology? • Some thoughts about dealing with biological and biomedical researchers in general • A brief description of IU’s high performance computing, storage, and visualization environments The revolution in biology • Automated, high-throughput sequencing has revolutionized biology. • Computing has been a part of this revolution in three ways so far: – Computing has been essential to the assembly of genomes – There is now so much biological data available that it is impossible to utilize it effectively without aid of computers – Networking and the Web have made biological data generally and publicly available http://www.ncbi.nlm.nih.gov/Genbank/ genbankstats.html Indiana Genomics Initiative (INGEN) • Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000 • Build on traditional strengths and add new areas of research for IU • Perform the research that will generate new treatments for human disease in the postgenomic era • Improve human health generally and in the State of Indiana particularly • Enhance economic growth in Indiana INGEN Structure Programs – – – – – – Bioethics Genomics Bioinformatics Medical Informatics Education Training Cores – Tech Transfer – Gene Expression – Cell & Protein Expression – Human Expression – Proteomics – Integrated Imaging – In vivo Imaging – Animal – Information Technology ($6.7M) Challenges for UITS and the INGEN IT Core • Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing) • Assist bioinformatics researchers in use of advanced computing facilities • Questions we are asked: – Why wouldn't it be better just to buy me a newer PC? • Questions we asked: – What do you do now with computers that you would like to do faster? – What would you do if computer resources were not a constraint? So, why is this better than just buying me a new PC? • Unique facilities provided by IT Core – Redundant data storage – HPC – better uniprocessor performance; trivially parallel programming, parallel programming – Visualization in the research laboratories • Hardcopy document – INGEN's advanced IT facilities: The least you need to know • Outreach efforts • Demonstration projects Example projects • Multiple simultaneous Matlab jobs for brain imaging. • Installation of many commercial and open source bioinformatics applications. • Site licenses for several commercial packages • Evaluation of several software products that were not implemented • Creation of new software Software pages from external sourcces • Commercial – GCG/Seqweb – DiscoveryLink – PAUP • Open Source – – – – BLAST FASTA CLUSTALW AutoDock • Several programs written by UITS staff Creation of new software • Gamma Knife – Penelope. Modified existing version for more precise targeting with IU's Gamma Knife. • Karyote (TM) Cell model. Developed a portion of the code used for model cell function. http://biodynamics.indiana.edu/ • PiVNs. Software to visualize human family trees • 3-DIVE (3D Interactive Volume Explorer). http://www.avl.iu.edu/projects/3DIVE/ • Protein Family Annotator – collaborative development with IBM, Inc. • Centralized Life Sciences Data service • fastDNAml – maximum likelihood phylogenies (http://www.indiana.edu/~rac/hpc/fastDNAml/index.html) Data Integration • Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges • IU has more than 1 TB of biomedical data stored in massive data storage system • There are many public data sources • Different labs were independently downloading, subsetting, and formatting data • Solution: IBM DiscoveryLink, DB/2 Information Integrator A life sciences data example Centralized Life Science Database • Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM) • Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. • Lab data and programs like BLAST are included via DL’s wrappers. • Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences • IU contributed writing of data parsers A computational example - evolutionary biology • Evolutionary trees describe how different organisms relate to each other • This was originally done by comparison of fossils • Statistical techniques and genomic data have made possible new approaches fastDNAml: Building Phylogenetic Trees • Goal: an objective means by which phylogenetic trees can be estimated • The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 • Solution: heuristic search • Trees built incrementally. Trees are optimized in steps, and best tree(s) are then kept for next round of additions • High communication/compute ratio fastDNAml algorithm, incremental tree building • Compute the optimal tree for three taxa (chosen randomly) - only one topology possible • Randomly pick another taxon, and consider each of the 2i-5 trees possible by adding this taxon into the first, three-taxa tree. • Keep the best (maximum likelihood tree) fastDNAml algorithm - Branch rearrangement • Local branch rearrangement: move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities) • Keep best resulting tree • Repeat this step until local swapping no longer improves likelihood value Because of local effects…. • Where you end up sometimes depends on where you start • This process searches a huge space of possible trees, and is thus dependent upon the randomly selected initial taxa • Can get stuck in local optimum, rather than global • Must do multiple runs with different randomizations of taxon entry order, and compare the results • Similar trees and likelihood values provide some confidence, but still the space of all possible trees has not been searched extensively fastDNAml parallel algorithm fastDNAml Performance on IBM SP 70 60 SpeedUp 50 40 30 20 10 0 0 10 20 30 40 50 Number of Processors Perfect Scaling 50 Taxa 101 Taxa 150 Taxa From Stewart et al., SC2001 60 70 Other grand challenge problems and some thoughts about the future Gamma Knife • Used to treat inoperable tumors • Treatment methods currently use a standardized head model • UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head “Simulation-only” studies Aquaporins -proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions. Massive simulation showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over. That breaks the ‘proton wire’ Work done at PSC Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002) 35,000ethours Klaus Schulten al, U. TCS of Illinois, SCIENCE (April 19, 2002) 35,000 hours TCS integrated Genomic Annotation Pipeline - iGAP structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) sequence info NR, PFAM 104 entries Deduced Protein sequences ~800 genomes @ 10k-20k per =~107 ORF’s Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 4 CPU years 228 CPU years 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 570 CPU years Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence Slide source: San Diego Supercomputing Center 252 CPU years 3 CPU years Store assigned regions in the DB Drug Design • Protein folding “the right way” – Homology modeling – Then adjust for sidechain variations, etc. • Drug screening – Target generation – so what – Target verification – that’s important! – Toxicity prediction – VERY important What is the killer application in computational biology? • Systems biology – latest buzzword, but…. (see special issues in Nature and Science) • Goal: multiscale modeling from cell chemistry up to multiple populations • Current software tools still inadequate • Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications • Current challenge examples: actin fiber creation, heart attack modeling • Opportunity for predictive biology? Current challenge areas Problem High Throughput Grid Capability Protein modeling X Genome annotation, alignment, phylogenetics X X x* Drug Target Screening X X X (corporate grids) Systems biology X X Medical practice support X X *Only a few large scale problems merit ‘capability’ status Other example large-scale computational biology grid projects • Department of Energy “Genomes to Life” http://doegenomestolife.org/ • Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/ • Asia Pacific BioGrid (http://www.apbionet.org/) • Encyclopedia of Life (http://eol.sdsc.edu/) Thoughts about working with biologists Bioinformatics and Biomedical Research • Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. • Traditional biomedical researchers must take advantage of new possibilities • Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers Anopheles gambiae From www.sciencemag.org/feature/ data/mosquito/mtm/index.html Source Library: Centers for Disease Control Photo Credit: Jim Gathany INGEN IT Status Overall • • • • So far, so good 108 users of IU’s supercomputers 104 users of massive data storage system Six new software packages created or enhanced, more than 20 packages installed for use by INGEN-affiliated researchers • Three software packages made available as open source software as direct result of INGEN. Opportunities for tech transfer due to use of Lesser GNU. • The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc. • Work on Penelope code for Gamma Knife likely to be first major transferable technology development. Stands to improve efficacy of Gamma Knife treatment at IU. So how do you find biologists with whom to collaborate? • Chicken and egg problem? • Or more like fishing? • Or bank robbery? Bank robbery • Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “because that's where the money is.”* • Cultivating collaborations with biologists in the short run will require: – Active outreach – Different expectations than we might have when working with an aerospace design firm – Patience • There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.” *Unfortunately this is an urban legend; Sutton never said this Some information about the Indiana University high performance computing environment Networking: I-light • Network jointly owned by Indiana University and Purdue University • 36 fibers between Bloomington and Indianapolis (IU’s main campuses) • 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus) • Co-location with Abilene GigaPOP • Expansion to other universities recently funded Sun E10000 (Solar) • • • • • • Acquired 4/00 Shared memory architecture ~52 GFLOPS 64 400MHz cpus, 64GB memory > 2 TB external disk Supports some bioinformatics software available only (or primarily) under Solaris (e.g. GCG/SeqWeb) • Used extensively by researchers using large databases (db performance, cheminformatics, knowledge management) Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer IBM Research SP (Aries/Orion Complex) • 632 cpus, 1.005 TeraFLOPS. First University-owned supercomputer in US to exceed 1 TFLOPS aggregate peak theoretical processing capacity. • Geographically distributed at IUB and IUPUI • Initially 50th, now 170th in Top 500 supercomputer list • Distributed memory system with shared memory nodes Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer AVIDD • AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data • Project funded largely by the National Science Foundation (NSF), funds from Indiana University, and also by a Shared University Research grant from IBM, Inc. AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data • Hardware components: – Distributed Linux cluster • Three locations: IU Northwest, Indiana University Purdue University Indianapolis, IU Bloomington • 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk • Tuned, configured, and optimized for handling real-time data streams – A suite of distributed visualization environments – Massive data storage • Usage components: – Research by application scientists – Research by computer scientists – Education Goals for AVIDD • Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving) • Focused on modern instruments that produce data in digital format at high rates. Example instruments: – Advanced Photon Source, Advanced Light Source – Atmospheric science instruments in forest – Gene sequencers, expression chip readers Goals for AVIDD, Con’t • Performance goals: – Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running) – The system should be able to give (nearly) immediate attention to realtime computing tasks, while still running at high rates of overall utilization – It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours • Science goals: – The distribution of 3D visualization environments in scientists’ labs should enhance the ability of scientists to spontaneously interact with their data. – Ability to manage large data sets should no longer be an obstacle to scientific research – AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research Real-time pre-emption of jobs • High overall rate of utilization, while able to respond ‘immediately’ to requests for real-time data analysis. • System design – Maui Scheduler: support multiple QoS levels for jobs – PBSPro: support multiple QoS, and provide signaling for job termination, job suspension, and job checkpointing – LAM/MPI and Redhat: kernel-level checkpointing • Options to be supported: – – – – – cancel and terminate job Re-queue job signal, wait, and requeue job checkpoint job (as available) signal job (used to send SIGSTOP/SIGRESUME) 1 TFLOPS Achieved on Linpack! • AVIDD-I and AVIDD-B together = have peak theoretical capacity of 1.997 TFLOPS. • We have just achieved 1.02 TFLOPS on Linpack benchmark for this distributed system. • 51st place on current Top500 list; highest ranked distributed linux cluster • Details: – Force10 switches, non-routing 20 GB/Sec network connecting AVIDD-I and AVIDD-B. (~90 km distance) – LINPACK implementation from University of Tenessee called HPL (High Perfomrance LINPACK), ver 1.0 (http://www.netlib.org/benchmark/hpl/). Problem size we used is 220000, and block size is 200. – LAM/MPI 6.6 beta development version (3/23/2003) – Tuning: block size (optimized for smaller matrices, and then seemed to continue to work well), increased the default frame size for communications, fiddled with number of systems used, rebooted entire system just before running benchmark (!) Cost of grid computing on performance • Each of the two clusters alone achieved 682.5 GFLOPS, or 68% of peak theoretical of 998.4 GFLOPS per cluster • The aggregate distributed cluster achieved 1.02 TFLOPS out of 1.997, or 51% of peak theoretical Massive Data Storage System • Based on HPSS (High Performance Software System) • First HPSS installation with distributed movers; STK 9310 Silos in Bloomington and Indianapolis • Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable. • 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data. • Common File System (CFS) – disk storage ‘for the masses’ Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer John-E-Box Invented by John N. Huffman, John C. Huffman, and Eric Wernert Acknowledgments • This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. • This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). Acknowledgements con’t • UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar • Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock • Assistance with this presentation: John Herrin, Malinda Lingwall • Thanks to Dr. M. Resch, Director, HLRS, for inviting me to visit HLRS • Thanks to Dr. H. Bungartz for his hospitality, help, and for including Einführung in die Bioinformatik as an elective • Thanks to Dr. S. Zimmer for help throughout the semester • Further information is available at – ingen.iu.edu – http://www.indiana.edu/~uits/rac/ – http://www.ncsc.org/casc/paper.html – http://www.indiana.edu/~rac/staff_papers.html • A recommended German bioinformatics site: – http://www.bioinformatik.de/ • Paper coming soon for SIGUCCS conference Oct. 2003